问题
I created a script for the frequency of words in a document using the object and a dictionary documentTermMatrix in R. The script works on individual words and not on the compound word es. "foo" "bar" "foo bar"
This is the code
require(tm)
my.docs <- c("foo bar word1 word2")
myCorpus <- Corpus(VectorSource(my.docs))
inspect(DocumentTermMatrix(myCorpus,list(dictionary = c("foo","bar","foo bar"))))
But the result is
Terms
Docs bar foo foo bar
1 1 1 0
I would have to find one "foo bar" = 1
How can I fix this?
回答1:
The problem is that DocummentTermMatrix(...)
is tokenizing at word breaks be default. You need at least bigrams.
Credit to this post for the basic approach.
library(tm)
library(RWeka)
my.docs <- c("foo bar word1 word2")
myCorpus <- Corpus(VectorSource(my.docs))
myDict <- c("foo","bar","foo bar")
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
inspect(DocumentTermMatrix(myCorpus, control=list(tokenize=BigramTokenizer,
dictionary=myDict)))
# <<DocumentTermMatrix (documents: 1, terms: 3)>>
# ...
# Terms
# Docs bar foo foo bar
# 1 1 1 1
来源:https://stackoverflow.com/questions/26764187/counter-ngram-with-tm-package-in-r