In R tm package, build corpus FROM Document-Term-Matrix

问题

It's straightforward to build a document-term matrix from a corpus with the tm package. I'd like to build a corpus from a document-term-matrix.

Let M be the number of documents in a document set. Let V be the number of terms in the vocabulary of that document set.Then a document-term-matrix is an M*V matrix.

I also have a vocabulary vector, of length V. In the vocabulary vector are the words represented by indices in the document-term-matrix.

From the dtm and vocabulary vector, I'd like to construct a "corpus" object. This is because I'd like to stem my document set. I built my dtm and vocab manually - i.e. there never was a tm "corpus" object representing my dataset, so I can't use the function,

tm_map(corpus, stemDocument, language="english")

I've been trying to build a workaround where I stem the vocabulary and only keep unique words, but then it gets somewhat complicated trying to maintain the correspondence between the dtm and the vocabulary vector.

Ideally, the end result would be that my vocabulary vector is stemmed and only contains unique entries, and the dtm indices correspond to the stemmed vocabulary vector. If you can think of some other way to do that, I would appreciate that as well.

My troubles would be fixed if I could simply build a tm "corpus" from my dtm and vocabulary vector, stem the corpus, and then convert back to a dtm and vocabulary vector (I already know how to make those conversions).

Let me know if I can clarify the problem any further.

回答1:

Here's on approach providing my own minimal reproducible example (as a new user you may not be aware that this is your responsibility) from the tm package:

## Minimal Reproducible Example
library(tm)
data("crude")
dtm <- DocumentTermMatrix(crude,
    control = list(weighting =
    function(x)
        weightTfIdf(x, normalize = FALSE),
        stopwords = TRUE))

## Convert tdm to a list of text
dtm2list <- apply(dtm, 1, function(x) {
    paste(rep(names(x), x), collapse=" ")
})

## convert to a Corpus
myCorp <- VCorpus(VectorSource(dtm2list))
inspect(myCorp)

## Stemming
myCorp <- tm_map(myCorp, stemDocument)
inspect(myCorp)

来源：https://stackoverflow.com/questions/24418893/in-r-tm-package-build-corpus-from-document-term-matrix

标签

text-mining

corpus

lda