问题
I would like to be able to use a tm like package to be able to split and identify non English characters (mainly Japanese/Thai/Chinese) with R. What I would like to do is convert it into some sort of matrix like format and then run a Random Forest/Logistic regression for text classification. Is there any possibility to do this with tm or another R package?
回答1:
EDIT:
It looks like R has a hard time reading in non-English characters in as text. I tried scraping the Chinese alphabet from the web and got a result that may help, if character encoding is consistent.
### Require package used to parse HTML Contents of a web page
require(XML)
### Open an internet connection
url <- url('http://www.chinese-tools.com/characters/alphabet.html')
### Read in Content line by line
page <- readLines(url, encoding = "UTF-8")
### Parse HTML Code
page <- htmlParse(page)
### Create a list of tables
page <- readHTMLTable(page)
### The alphabet is contained in the third table of the page
alphabet <- as.data.frame(page[3])
You now have a list of US Alphabet characters, with another column corresponding to how these characters have been read into R. If they were read in the same way in your original object that you wish to text mine, would it be possible to use Regular Expressions to search for these encoded characters one at a time?
来源:https://stackoverflow.com/questions/16174561/how-can-i-process-chinese-japanese-characters-with-r