tm Package error: Error definining Document Term Matrix

问题

I am analyzing the Reuters 21578 corpus, all the Reuters news articles from 1987, using the "tm" package. After importing the XML files into an R data file, I clean the text--convert to plaintext, convert to lwer case, remove stop words etc. (as seen below)--then I try to convert the corpus to a document term matrix but receive an error message:

Error in UseMethod("Content", x) : no applicable method for 'Content' applied to an object of class "character"

All pre-processing steps work correctly up until document term matrix.

I created a non-random subset of the corpus (with 4000 documents) and the document term matrix command works fine on that.

My code is below. Thanks for the help.

##Import
file <- "reut-full.xml" 
reuters <- Corpus(ReutersSource(file), readerControl = list(reader = readReut21578XML))

## Convert to Plain Text Documents
reuters <- tm_map(reuters, as.PlainTextDocument)

## Convert to Lower Case
reuters <- tm_map(reuters, tolower)

## Remove Stopwords
reuters <- tm_map(reuters, removeWords, stopwords("english"))

## Remove Punctuations
reuters <- tm_map(reuters, removePunctuation)

## Stemming
reuters <- tm_map(reuters, stemDocument)

## Remove Numbers
reuters <- tm_map(reuters, removeNumbers)

## Eliminating Extra White Spaces
reuters <- tm_map(reuters, stripWhitespace)

## create a term document matrix
dtm <- DocumentTermMatrix(reuters)

Error in UseMethod("Content", x) : 
  no applicable method for 'Content' applied to an object of class "character"

来源：https://stackoverflow.com/questions/10377273/tm-package-error-error-definining-document-term-matrix

标签

text-analysis

reuters