问题
I have a termDocumentMatrix
created using the tm
package in R.
I'm trying to create a matrix/dataframe that has the 50 most frequently occurring terms.
When I try to convert to a matrix I get this error:
> ap.m <- as.matrix(mydata.dtm)
Error: cannot allocate vector of size 2.0 Gb
So I tried converting to sparse matrices using Matrix package:
> A <- as(mydata.dtm, "sparseMatrix")
Error in as(from, "CsparseMatrix") :
no method or default for coercing "TermDocumentMatrix" to "CsparseMatrix"
> B <- Matrix(mydata.dtm, sparse = TRUE)
Error in asMethod(object) : invalid class 'NA' to dup_mMatrix_as_geMatrix
I've tried accessing the different parts of the tdm using:
> freqy1 <- data.frame(term1 = findFreqTerms(mydata.dtm, lowfreq=165))
> mydata.dtm[mydata.dtm$ Terms %in% freqy1$term1,]
Error in seq_len(nr) : argument must be coercible to non-negative integer
Here's some other info:
> str(mydata.dtm)
List of 6
$ i : int [1:430206] 377 468 725 3067 3906 4150 4393 5188 5793 6665 ...
$ j : int [1:430206] 1 1 1 1 1 1 1 1 1 1 ...
$ v : num [1:430206] 1 1 1 1 1 1 1 1 2 3 ...
$ nrow : int 15643
$ ncol : int 17207
$ dimnames:List of 2
..$ Terms: chr [1:15643] "000" "0mm" "100" "1000" ...
..$ Docs : chr [1:17207] "1" "2" "3" "4" ...
- attr(*, "class")= chr [1:2] "TermDocumentMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
> mydata.dtm
A term-document matrix (15643 terms, 17207 documents)
Non-/sparse entries: 430206/268738895
Sparsity : 100%
Maximal term length: 54
Weighting : term frequency (tf)
My ideal output is something like this:
term frequency
the 2123
and 2095
able 883
... ...
Any suggestions?
回答1:
The term-document matrices in tm are already created as sparse matrices. Here, mydata.tdm$i
and mydata.tdm$j
are the vectors of indexes of the matrix and mydata.tdm$v
is the related vector of frequencies. So that you can create a sparse matrix writing :
sparseMatrix(i=mydata.tdm$i, j=mydata.tdm$j, x=mydata.tdm$v)
Then you can use rowSums
and link the rows, you're interested in, to the terms, they stand for, with $Terms
.
来源:https://stackoverflow.com/questions/11508728/r-tm-package-create-matrix-of-nmost-frequent-terms