Using R for Text Mining Reuters-21578

问题

I am trying to do some work with the well known Reuters-21578 dataset and am having some trouble with loading the sgm files into my corpus.

Right now I am using the command

require(tm)
reut21578 <- system.file("reuters21578", package = "tm")
reuters <-Corpus(DirSource(reut21578), 
    readerControl = list(reader = readReut21578XML))

In an attempt to include all the files into my corpus but this gives me the following error:

Error in DirSource(reut21578) : empty directory

Any idea where I may be going wrong?

回答1:

The "tm" package includes only sample of the Reuters21578 data. If you want to avoid downloading, loading and preparing all the 22 Reuters21578 files, you can use package "tm.corpus.Reuters21578":

install.packages("tm.corpus.Reuters21578", repos = "http://datacube.wu.ac.at")
library(tm.corpus.Reuters21578)
data(Reuters21578)

来源：https://stackoverflow.com/questions/20184541/using-r-for-text-mining-reuters-21578

标签

corpus

reuters

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!