Counter ngram with tm package in R

问题

I created a script for the frequency of words in a document using the object and a dictionary documentTermMatrix in R. The script works on individual words and not on the compound word es. "foo" "bar" "foo bar"

This is the code

require(tm)
my.docs <- c("foo bar word1 word2")
myCorpus <- Corpus(VectorSource(my.docs))
inspect(DocumentTermMatrix(myCorpus,list(dictionary = c("foo","bar","foo bar"))))

But the result is

Terms

Docs bar foo  foo bar

   1   1   1        0

I would have to find one "foo bar" = 1

How can I fix this?

回答1:

The problem is that DocummentTermMatrix(...) is tokenizing at word breaks be default. You need at least bigrams.

Credit to this post for the basic approach.

library(tm)
library(RWeka)
my.docs <- c("foo bar word1 word2")
myCorpus <- Corpus(VectorSource(my.docs))
myDict   <- c("foo","bar","foo bar")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
inspect(DocumentTermMatrix(myCorpus, control=list(tokenize=BigramTokenizer,
                                                  dictionary=myDict)))
# <<DocumentTermMatrix (documents: 1, terms: 3)>>
# ...
#     Terms
# Docs bar foo foo bar
#    1   1   1       1

来源：https://stackoverflow.com/questions/26764187/counter-ngram-with-tm-package-in-r

标签

dictionary

frequency

text-mining

易学教程内所有资源均来自网络或用户发布的内容，如有违反法律规定的内容欢迎反馈！
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!