R tm stemCompletion generates NA value

时间秒杀一切 提交于 2019-12-11 02:16:14

问题


when i try to apply stemCompletion to a corpus , this function generates NA values..

this is my code:

my.corpus <- tm_map(my.corpus, removePunctuation) 
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english")) 

(one result of this is: [[2584]] zoning plan )

the next step is stamming corpus and so:

my.corpus <- tm_map(my.corpus, stemDocument, language="english")
my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first")

but result is this

[[2584]] NA plant

the next step should be the creation of an incidence matrix with transactions and then apriori rules but if i go on and try to get rules, the inspect(rules) function gives me this error:

> inspect(rules)
Errore in UseMethod("inspect", x) : 
no applicable method for 'inspect' applied to an object of class "c('rules','associations')"

what's the problem? i suppose that NA values don't generate correctly the incidence matrix and then good rules.. is this the problem? if so how i can solve it?

this is an abstract of the problem:

this is an abstract:

my.words = c("β cell","zoning policy regional index brazil","zoning plan","zolpidem  adult","zizyphus spinosa hu")
my.corpus = Corpus(VectorSource(my.words))
my.corpus_copy = my.corpus
my.corpus = tm_map(my.corpus, removePunctuation)
my.corpus = tm_map(my.corpus, removeWords, c("the", stopwords("english"))) 
my.corpus = tm_map(my.corpus, stemDocument, language="english")
my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first")
inspect(my.corpus)

回答1:


stemCompletion() at this moment is only an approximate reversal of stemming process if original corpus is used as a dictionary parameter. Using grep() it searches in the dictionary all the words, which contain current stemmed word and then uses one of these for completion based upon the ‘type’.

Thus it fails in cases where stemming process returned words which are not substrings of the un-stemmed words. For example, stems of ‘c('delivery’, 'zoning') are c('deliveri', 'zone') as returned by wordStem() used in stemDocument(). However, in both of these cases, stemmed words are not proper substrings of the un-stemmed words. Therefore, stemCompletion() would not find any replacement and would return NA.

There are many alternatives to overcome this problem including replacing NAs with stemmed-words after returning from stemCompletion() or better modifying the stemCompletion() function itself. A simple way to modify it so that instead of NA it retains the stemmed-word is to have your own version of it stemCompletion_modified(): (replace ... with original code from stemCompletion() function in tm package)

stemCompletion_modified <- function (x, dictionary, type = ...) 
{
  ...
  #possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
  possibleCompletions <- lapply(x, function(w) ifelse(identical(grep(sprintf("^%s", w), dictionary, value = TRUE),character(0)),w,grep(sprintf("^%s", w), dictionary, value = TRUE)))
  ...
} 


来源:https://stackoverflow.com/questions/18782455/r-tm-stemcompletion-generates-na-value

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!