Converting a Document Term Matrix into a Matrix with lots of data causes overflow

蹲街弑〆低调 提交于 2019-11-30 05:11:34

Integer overflow tells you exactly what the problem is : with 40000 documents, you have too much data. It is in the conversion to a matrix that the problem begins btw, which can be seen if you look at the code of the underlying function :

class(dtm)
[1] "TermDocumentMatrix"    "simple_triplet_matrix"

getAnywhere(as.matrix.simple_triplet_matrix)

A single object matching ‘as.matrix.simple_triplet_matrix’ was found
...
function (x, ...) 
{
    nr <- x$nrow
    nc <- x$ncol
    y <- matrix(vector(typeof(x$v), nr * nc), nr, nc)
   ...
}

This is the line referenced by the error message. What's going on, can be easily simulated by :

as.integer(40000 * 60000) # 40000 documents is 40000 rows in the resulting frame
[1] NA
Warning message:
NAs introduced by coercion 

The function vector() takes an argument with the length, in this case nr*nc If this is larger than appx. 2e9 ( .Machine$integer.max ), it will be replaced by NA. This NA is not valid as an argument for vector().

Bottomline : You're running into the limits of R. As for now, working in 64bit won't help you. You'll have to resort to different methods. One possibility would be to continue working with the list you have (dtm is a list), selecting the data you need using list manipulation and go from there.

PS : I made a dtm object by

require(tm)
data("crude")
dtm <- TermDocumentMatrix(crude,
                          control = list(weighting = weightTfIdf,
                                         stopwords = TRUE))

Here is a very very simple solution I discovered recently

DTM=t(TDM)#taking the transpose of Term-Document Matrix though not necessary but I prefer DTM over TDM
M=as.big.matrix(x=as.matrix(DTM))#convert the DTM into a bigmemory object using the bigmemory package 
M=as.matrix(M)#convert the bigmemory object again to a regular matrix
M=t(M)#take the transpose again to get TDM

Please note that taking transpose of TDM to get DTM is absolutely optional, it's my personal preference to play with matrices this way

P.S.Could not answer the question 4 years back as I was just a fresh entry in my college

Based on Joris Meys answer, I've found the solution. "vector()" documentation regarding "length" argument

... For a long vector, i.e., length > .Machine$integer.max, it has to be of type "double"...

So we can make a tiny fix of the as.matrix():

as.big.matrix <- function(x) {
  nr <- x$nrow
  nc <- x$ncol
  # nr and nc are integers. 1 is double. Double * integer -> double
  y <- matrix(vector(typeof(x$v), 1 * nr * nc), nr, nc)
  y[cbind(x$i, x$j)] <- x$v
  dimnames(y) <- x$dimnames
  y
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!