Trying to remove words from a DocumentTermMatrix in order to use topicmodels

自作多情 提交于 2019-12-06 05:56:15

问题


So, I am trying to use the topicmodels package for R (100 topics on a corpus of ~6400 documents, which are each ~1000 words). The process runs and then dies, I think because it is running out of memory.

So I try to shrink the size of the document term matrix that the lda() function takes as input; I figure I can do that do using the minDocFreq function when I generate my document term matrices. But when I use it, it doesn't seem to make any difference. Here is some code:

Here is the relevant bit of code:

> corpus <- Corpus(DirSource('./chunks/'),fileEncoding='utf-8')
> dtm <- DocumentTermMatrix(corpus)
> dim(dtm)
[1] 6423 4163
# So, I assume this next command will make my document term matrix smaller, i.e.
# fewer columns. I've chosen a larger number, 100, to illustrate the point.
> smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))
> dim(smaller)
[1]  6423 41613

Same dimensions, and same number of columns (that is, same number of terms).

Any sense what I'm doing wrong? Thanks.


回答1:


The answer to your question is over here: https://stackoverflow.com/a/13370840/1036500 (give it an upvote!)

In brief, more recent versions of the tm package do not include minDocFreq but instead use bounds, for example, your

smaller <- DocumentTermMatrix(corpus, control=list(minDocFreq=100))

should now be

require(tm)
data("crude")

smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(5,Inf))))
dim(smaller) # after Terms that appear in <5 documents are discarded
[1] 20 67
smaller <- DocumentTermMatrix(crude, control=list(bounds = list(global = c(10,Inf))))
dim(smaller) # after Terms that appear in <10 documents are discarded
[1] 20 17


来源:https://stackoverflow.com/questions/16287546/trying-to-remove-words-from-a-documenttermmatrix-in-order-to-use-topicmodels

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!