R tm stemCompletion generates NA value

问题

when i try to apply stemCompletion to a corpus , this function generates NA values..

this is my code:

my.corpus <- tm_map(my.corpus, removePunctuation) 
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))

(one result of this is: [[2584]] zoning plan )

the next step is stamming corpus and so:

my.corpus <- tm_map(my.corpus, stemDocument, language="english")
my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first")

but result is this

[[2584]] NA plant

the next step should be the creation of an incidence matrix with transactions and then apriori rules but if i go on and try to get rules, the inspect(rules) function gives me this error:

> inspect(rules)
Errore in UseMethod("inspect", x) : 
no applicable method for 'inspect' applied to an object of class "c('rules','associations')"

what's the problem? i suppose that NA values don't generate correctly the incidence matrix and then good rules.. is this the problem? if so how i can solve it?

this is an abstract of the problem:

this is an abstract:

my.words = c("β cell","zoning policy regional index brazil","zoning plan","zolpidem  adult","zizyphus spinosa hu")
my.corpus = Corpus(VectorSource(my.words))
my.corpus_copy = my.corpus
my.corpus = tm_map(my.corpus, removePunctuation)
my.corpus = tm_map(my.corpus, removeWords, c("the", stopwords("english"))) 
my.corpus = tm_map(my.corpus, stemDocument, language="english")
my.corpus <- tm_map(my.corpus, stemCompletion, dictionary=my.corpus_copy, type="first")
inspect(my.corpus)

回答1:

stemCompletion() at this moment is only an approximate reversal of stemming process if original corpus is used as a dictionary parameter. Using grep() it searches in the dictionary all the words, which contain current stemmed word and then uses one of these for completion based upon the ‘type’.

Thus it fails in cases where stemming process returned words which are not substrings of the un-stemmed words. For example, stems of ‘c('delivery’, 'zoning') are c('deliveri', 'zone') as returned by wordStem() used in stemDocument(). However, in both of these cases, stemmed words are not proper substrings of the un-stemmed words. Therefore, stemCompletion() would not find any replacement and would return NA.

There are many alternatives to overcome this problem including replacing NAs with stemmed-words after returning from stemCompletion() or better modifying the stemCompletion() function itself. A simple way to modify it so that instead of NA it retains the stemmed-word is to have your own version of it stemCompletion_modified(): (replace ... with original code from stemCompletion() function in tm package)

stemCompletion_modified <- function (x, dictionary, type = ...) 
{
  ...
  #possibleCompletions <- lapply(x, function(w) grep(sprintf("^%s", w), dictionary, value = TRUE))
  possibleCompletions <- lapply(x, function(w) ifelse(identical(grep(sprintf("^%s", w), dictionary, value = TRUE),character(0)),w,grep(sprintf("^%s", w), dictionary, value = TRUE)))
  ...
}

来源：https://stackoverflow.com/questions/18782455/r-tm-stemcompletion-generates-na-value

标签

stemming