How can I process Chinese/ Japanese characters with R [closed]

自闭症网瘾萝莉.ら 提交于 2020-01-03 03:02:25

问题


I would like to be able to use a tm like package to be able to split and identify non English characters (mainly Japanese/Thai/Chinese) with R. What I would like to do is convert it into some sort of matrix like format and then run a Random Forest/Logistic regression for text classification. Is there any possibility to do this with tm or another R package?


回答1:


EDIT:

It looks like R has a hard time reading in non-English characters in as text. I tried scraping the Chinese alphabet from the web and got a result that may help, if character encoding is consistent.

### Require package used to parse HTML Contents of a web page
require(XML)
### Open an internet connection
url <- url('http://www.chinese-tools.com/characters/alphabet.html')
### Read in Content line by line
page <- readLines(url, encoding = "UTF-8")
### Parse HTML Code
page <- htmlParse(page)
### Create a list of tables
page <- readHTMLTable(page)
### The alphabet is contained in the third table of the page
alphabet <- as.data.frame(page[3])

You now have a list of US Alphabet characters, with another column corresponding to how these characters have been read into R. If they were read in the same way in your original object that you wish to text mine, would it be possible to use Regular Expressions to search for these encoded characters one at a time?



来源:https://stackoverflow.com/questions/16174561/how-can-i-process-chinese-japanese-characters-with-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!