问题
I'm very new to R and cannot quite wrap my head around DocumentTermMatrixs. I have a DocumentTermMatrix created with the TM package, it has the term frequency and the terms inside it but I cannot figure out how to access them.
Ideally, I would like:
Term #
"the" 200
"is" 400
"a" 200
Currently my code is:
library(tm)
common.words <- c("amp","@RT","I","http","https", stopwords("english"), "you")
x <- Corpus(VectorSource(results))
x <- tm_map(x, stripWhitespace)
x <- tm_map(x, removeNumbers)
x <- tm_map(x, removePunctuation)
x <- tm_map(x, stripWhitespace)
dtm <- DocumentTermMatrix(x)
for(i in 1:length(common.words)) {
dtm <- dtm[,!colnames(dtm)%in%c(common.words[i])]
}
This is the output from str(dtm)
List of 6
$ i : int [1:9769] 1 1 1 1 1 1 1 1 2 2 ...
$ j : int [1:9769] 1596 1684 1858 2112 2175 2490 2714 2814 873 961 ...
$ v : num [1:9769] 1 1 2 1 1 2 1 1 1 1 ...
$ nrow : int 1477
$ ncol : int 3201
$ dimnames:List of 2
..$ Docs : chr [1:1477] "1" "2" "3" "4" ...
..$ Terms: chr [1:3201] "\u0093\u0085a" "aardvark" "aaron" "abbie" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "Weighting")= chr [1:2] "term frequency" "tf"
Thank you,
-A
回答1:
It appears to be a sparse matrix organization of the data. It appears that the frequency is in the "v" list and you get that by looking up the position of your term in the Terms attribute. Why not provide dput(head(results, 30))
so your code (and your SO audience) will have something to work on? After plying around with the examples in the package, I suspect you actually want something along the lines of:
tdm <- TermDocumentMatrix(x)
z <- inspect( tdm[ c("the", "is", "a"), dimnames(tdm)$Docs] )
rowSums(z)
回答2:
I had the same problem and found what I think is a simpler way:
num <- 10 # Show this many top frequent terms
tdm[findFreqTerms(tdm)[1:num],] %>%
as.matrix() %>%
rowSums()
Printing in columns is trickier (I'm sure someone has a much better way than this):
terms <- findFreqTerms(tdm)[1:num]
tdm[terms,] %>%
as.matrix() %>%
rowSums() %>%
data.frame(Term = terms, Frequency = .) %>%
arrange(desc(Frequency))
来源:https://stackoverflow.com/questions/14426925/frequency-per-term-r-tm-documenttermmatrix