R and tm package: create a term-document matrix with a dictionary of one or two words?

谁说我不能喝 提交于 2019-12-09 07:01:59

问题


Purpose: I want to create a term-document matrix using a dictionary which has compound words, or bigrams, as some of the keywords.

Web Search: Being new to text-mining and the tm package in R, I went to the web to figure out how to do this. Below are some relevant links that I found:

  • FAQS on the tm-package website
  • finding 2 & 3 word phrases using r tm package
  • counter ngram with tm package in r
  • findassocs for multiple terms in r

Background: Of these, I preferred the solution that uses NGramTokenizer in the RWeka package in R, but I ran into a problem. In the example code below, I create three documents and place them in a corpus. Note that Docs 1 and 2 each contain two words. Doc 3 only contains one word. My dictionary keywords are two bigrams and a unigram.

Problem: The NGramTokenizer solution in the above links does not correctly count the unigram keyword in the Doc 3.

library(tm)
library(RWeka)

my.docs = c('jedi master', 'jedi grandmaster', 'jedi')
my.corpus = Corpus(VectorSource(my.docs))
my.dict = c('jedi master', 'jedi grandmaster', 'jedi')

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))

inspect(DocumentTermMatrix(my.corpus, control=list(tokenize=BigramTokenizer,
                                                  dictionary=my.dict)))

# <<DocumentTermMatrix (documents: 3, terms: 3)>>
# ...
# Docs  jedi  jedi grandmaster  jedi master
#    1     1                 0            1
#    2     1                 1            0
#    3     0                 0            0

I was expecting the row for Doc 3 to give 1 for jedi and 0 for the other two. Is there something I am misunderstanding?


回答1:


I ran into the same problem and found that token counting functions from the TM package rely on an option called wordLengths, which is a vector of two numbers -- the minimum and the maximum length of tokens to keep track of. By default, TM uses a minimum word length of 3 characters (wordLengths = c(3, Inf)). You can override this option by adding it to the control list in a call to DocumentTermMatrix like this:

DocumentTermMatrix(my.corpus,
                   control=list(
                       tokenize=newBigramTokenizer,
                       wordLengths = c(1, Inf)))

However, your 'jedi' word is more than 3 characters long. Although, you probably tweaked the option's value earlier while trying to figure out how to count ngrams, so still try this. Also, look at the bounds option, which tells TM to discard words less or more frequent than specified values.




回答2:


I noticed that NGramTokenizer returns character(0) when a one-word string is submitted as input and NGramTokenizer is asked to return unigrams and bigrams.

NGramTokenizer('jedi',  Weka_control(min = 1, max = 2))
# character(0)

I am not sure why this is the output, but I believe this behavior is the reason why the keyword jedi was not counted in Doc 3. However, a simple if-then-else solution appears to work for my situation: both for the sample set and my actual data set.

library(tm)
library(RWeka)    

my.docs = c('jedi master', 'jedi grandmaster', 'jedi')
my.corpus = Corpus(VectorSource(my.docs))
my.dict = c('jedi master', 'jedi grandmaster', 'jedi')

newBigramTokenizer = function(x) {
  tokenizer1 = NGramTokenizer(x, Weka_control(min = 1, max = 2))
  if (length(tokenizer1) != 0L) { return(tokenizer1)
  } else return(WordTokenizer(x))
} # WordTokenizer is an another tokenizer in the RWeka package.

inspect(DocumentTermMatrix(my.corpus, control=list(tokenize=newBigramTokenizer,
                                                 dictionary=my.dict)))

# <<DocumentTermMatrix (documents: 3, terms: 3)>>
# ...
# Docs jedi jedi grandmaster jedi master
#   1    1                0           1
#   2    1                1           0
#   3    1                0           0

Please let me know if anyone finds a "gotcha" that I am not considering in the code above. I would also appreciate any insight into why NGramTokenizer returns character(0) in my observation above.



来源:https://stackoverflow.com/questions/28033034/r-and-tm-package-create-a-term-document-matrix-with-a-dictionary-of-one-or-two

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!