Frequency Per Term - R TM DocumentTermMatrix

ⅰ亾dé卋堺 提交于 2020-01-13 11:33:15

问题


I'm very new to R and cannot quite wrap my head around DocumentTermMatrixs. I have a DocumentTermMatrix created with the TM package, it has the term frequency and the terms inside it but I cannot figure out how to access them.

Ideally, I would like:

    Term  # 
    "the" 200 
    "is"  400 
    "a"   200 

Currently my code is:

    library(tm)
    common.words <- c("amp","@RT","I","http","https", stopwords("english"), "you")
    x <- Corpus(VectorSource(results)) 
    x <- tm_map(x, stripWhitespace) 
    x <- tm_map(x, removeNumbers) 
    x <- tm_map(x, removePunctuation) 
    x <- tm_map(x, stripWhitespace)

    dtm <- DocumentTermMatrix(x)
    for(i in 1:length(common.words)) {
    dtm <- dtm[,!colnames(dtm)%in%c(common.words[i])]
    }

This is the output from str(dtm)

   List of 6
   $ i       : int [1:9769] 1 1 1 1 1 1 1 1 2 2 ...
   $ j       : int [1:9769] 1596 1684 1858 2112 2175 2490 2714 2814 873 961 ...
   $ v       : num [1:9769] 1 1 2 1 1 2 1 1 1 1 ...
   $ nrow    : int 1477
   $ ncol    : int 3201
   $ dimnames:List of 2
   ..$ Docs : chr [1:1477] "1" "2" "3" "4" ...
   ..$ Terms: chr [1:3201] "\u0093\u0085a" "aardvark" "aaron" "abbie" ...
    - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
    - attr(*, "Weighting")= chr [1:2] "term frequency" "tf"

Thank you,

-A


回答1:


It appears to be a sparse matrix organization of the data. It appears that the frequency is in the "v" list and you get that by looking up the position of your term in the Terms attribute. Why not provide dput(head(results, 30)) so your code (and your SO audience) will have something to work on? After plying around with the examples in the package, I suspect you actually want something along the lines of:

tdm <- TermDocumentMatrix(x)
z <- inspect( tdm[ c("the", "is", "a"), dimnames(tdm)$Docs] )
rowSums(z)



回答2:


I had the same problem and found what I think is a simpler way:

num <- 10 # Show this many top frequent terms

tdm[findFreqTerms(tdm)[1:num],] %>%
      as.matrix() %>%
      rowSums()

Printing in columns is trickier (I'm sure someone has a much better way than this):

terms <- findFreqTerms(tdm)[1:num]
tdm[terms,] %>%
      as.matrix() %>%
      rowSums()  %>% 
      data.frame(Term = terms, Frequency = .) %>%  
      arrange(desc(Frequency))


来源:https://stackoverflow.com/questions/14426925/frequency-per-term-r-tm-documenttermmatrix

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!