How can I cluster thousands of documents using the R tm package?

…衆ロ難τιáo~ 提交于 2019-12-09 23:14:27

问题


I have about 25000 documents which need to be clustered and I was hoping to be able to use the R tm package. Unfortunately I am running out of memory with about 20000 documents. The following function shows what I am trying to do using dummy data. I run out of memory when I call the function with n = 20 on a Windows machine with 16GB of RAM. Are there any optimizations I could make?

Thank you for any help.

make_clusters <- function(n) {
    require(tm)
    require(slam)
    docs <- unlist(lapply(letters[1:n],function(x) rep(x,1000)))
    tdf <- TermDocumentMatrix(Corpus(VectorSource(docs)),control=list(weighting=weightTfIdf,wordLengths=c(1,Inf)))
    tdf.norm <- col_norms(tdf)
    docs.simil <- crossprod_simple_triplet_matrix(tdf,tdf)/outer(tdf.norm,tdf.norm)
    hh <- hclust(as.dist(1-docs.simil))
}

来源:https://stackoverflow.com/questions/26149125/how-can-i-cluster-thousands-of-documents-using-the-r-tm-package

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!