rweka

Document-term matrix in R - bigram tokenizer not working

混江龙づ霸主 提交于 2019-11-28 08:21:30
问题 I am trying to make 2 document-term matrices for a corpus, one with unigrams and one with bigrams. However, the bigram matrix is currently just identical to the unigram matrix, and I'm not sure why. The code: docs<-Corpus(DirSource("data", recursive=TRUE)) # Get the document term matrices BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2)) dtm_unigram <- DocumentTermMatrix(docs, control = list(tokenize="words", removePunctuation = TRUE, stopwords = stopwords(