DocumentTermMatrix error on Corpus argument

浪子不回头ぞ 提交于 2019-11-27 10:33:34

It seems this would have worked just fine in tm 0.5.10 but changes in tm 0.6.0 seems to have broken it. The problem is that the functions tolower and trim won't necessarily return TextDocuments (it looks like the older version may have automatically done the conversion). They instead return characters and the DocumentTermMatrix isn't sure how to handle a corpus of characters.

So you could change to

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))

Or you can run

corpus_clean <- tm_map(corpus_clean, PlainTextDocument)

after all of your non-standard transformations (those not in getTransformations()) are done and just before you create the DocumentTermMatrix. That should make sure all of your data is in PlainTextDocument and should make DocumentTermMatrix happy.

Rodrigo Araujo

I have found a way to solve this problem in an article about TM.

An example in which the error follows below:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1") # import files
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,removePunctuation)
corpus <- tm_map(corpus,stripWhitespace)
corpus <- tm_map(corpus,removePunctuation);
matrix_terms <- DocumentTermMatrix(corpus)

Warning messages:

In TermDocumentMatrix.VCorpus(x, control) : invalid document identifiers

This error occurs because you need an object of the class Vector Source to do your Term Document Matrix, but the previous transformations transform your corpus of texts in character, therefore, changing a class which is not accepted by the function.

However, if you add the function content_transformer inside the tm_map command you may not need even one more command before using the function TermDocumentMatrix to keep going.

The code below changes the class (see second last line) and avoids the error:

getwd()
require(tm)
files <- DirSource(directory="texts/", encoding="latin1")
corpus <- VCorpus(x=files) # load files, create corpus

summary(corpus) # get a summary
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- tm_map(corpus,content_transformer(stripWhitespace))
corpus <- tm_map(corpus,content_transformer(removePunctuation))
corpus <- Corpus(VectorSource(corpus)) # change class 
matrix_term <- DocumentTermMatrix(corpus)
Renmelcon

Change this:

corpus_clean <- tm_map(news_corpus, tolower)

For this:

corpus_clean <- tm_map(news_corpus, content_transformer(tolower))
gopal

This should work.

remove.packages(tm)
install.packages("http://cran.r-project.org/bin/windows/contrib/3.0/tm_0.5-10.zip",repos=NULL)
library(tm)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!