R error in lemmatizzation a corpus of document with wordnet

问题

i'm trying to lemmatizzate a corpus of document in R with wordnet library. This is the code:

corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents removePunctuation)

library(wordnet)
lapply(corpus.documents,function(x){
  x.filter <- getTermFilter("ContainsFilter", x, TRUE)
  terms <- getIndexTerms("NOUN", 1, x.filter)
  sapply(terms, getLemma)
})

but when running this. I have this error:

Errore in .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word,  :
java.lang.NoSuchMethodError: <init>

and those are stack calls:

5 stop(structure(list(message = "java.lang.NoSuchMethodError: <init>", 
call = .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), 
    word, ignoreCase), jobj = <S4 object of class structure("jobjRef", package 
="rJava")>), .Names = c("message", 
"call", "jobj"), class = c("NoSuchMethodError", "IncompatibleClassChangeError",  ... 
4 .jnew(paste("com.nexagis.jawbone.filter", type, sep = "."), word, 
ignoreCase) 
3 getTermFilter("ContainsFilter", x, TRUE) 
2 FUN(X[[1L]], ...) 
1 lapply(corpus.documents, function(x) {
x.filter <- getTermFilter("ContainsFilter", x, TRUE)
terms <- getIndexTerms("NOUN", 1, x.filter)
sapply(terms, getLemma) ...

what's wrong?

回答1:

So this does not address your use of wordnet, but does provide an option for lemmatizing that might work for you (and is better, IMO...). This uses the MorphAdorner API developed at Northwestern University. You can find detailed documentation here. In the code below I'm using their Adorner for Plain Text API.

# MorphAdorner (Northwestern University) web service
adorn <- function(text) {
  require(httr)
  require(XML)
  url <- "http://devadorner.northwestern.edu/maserver/partofspeechtagger"
  response <- GET(url,query=list(text=text, media="xml", 
                                 xmlOutputType="outputPlainXML",
                                 corpusConfig="ncf", # Nineteenth Century Fiction
                                 includeInputText="false", outputReg="true"))
  doc <- content(response,type="text/xml")
  words <- doc["//adornedWord"]
  xmlToDataFrame(doc,nodes=words)
}

library(tm)
vector.documents <- c("Here is some text.", 
                      "This might possibly be some additional text, but then again, maybe not...",
                      "This is an abstruse grammatical construction having as it's sole intention the demonstration of MorhAdorner's capability.")
corpus.documents <- Corpus(VectorSource(vector.documents))
lapply(corpus.documents,function(x) adorn(as.character(x)))
# [[1]]
#   token spelling standardSpelling lemmata partsOfSpeech
# 1  Here     Here             Here    here            av
# 2    is       is               is      be           vbz
# 3  some     some             some    some             d
# 4  text     text             text    text            n1
# 5     .        .                .       .             .
# ...

I'm just showing the lemmatization of the first "document". partsOfSpeech follows the NUPOS convention.

回答2:

This answers your question, but does not really solve your problem. There is another solution above (different answer) that attempts to provide a solution.

There are several issues with the way you are using the wordnet package, described below, but the bottom line is that even after addressing these, I could not get wordnet to produce anything but gibberish.

First: You can't just install the wordnet package in R, you have to install Wordnet on your computer, or at least download the dictionaries. Then, before you use the package, you need to run initDict("path to wordnet dictionaries").

Second: It looks like getTermFilter(...) expects a character argument for x. The way you have it set up, you are passing an object of type PlainTextDocument. So you need to use as.character(x) to convert that to it's contained text, or you get the java error in your question.

Third: It looks like getTermFilter(...) expects single words (or phrases). For instance, if you pass "This is a phrase" to getTermFilter(...) it will look up "This is a phrase" in the dictionary. It will not find it of course, so getIndexTerms(...) returns NULL and getLemma(...) fails... So you have to parse the text of your PlainTextDocument into words first.

Finally, I'm not sure it's a good idea to remove punctuation. For instance "it's" will be converted to "its" but these are different words with different meanings, and they lemmatize differently.

Rolling all this up:

library(tm)
vector.documents <- c("This is a line of text.", "This is another one.")
corpus.documents <- Corpus(VectorSource(vector.documents))
corpus.documents <- tm_map(corpus.documents, removePunctuation)

library(wordnet)
initDict("C:/Program Files (x86)/WordNet/2.1/dict")
lapply(corpus.documents,function(x){
  sapply(unlist(strsplit(as.character(x),"[[:space:]]+")), function(word) {
    x.filter <- getTermFilter("StartsWithFilter", word, TRUE)
    terms    <- getIndexTerms("NOUN",1,x.filter)
    if(!is.null(terms)) sapply(terms,getLemma)
  })
})
# [[1]]
#                 This                   is                    a                 line                   of                 text 
#            "thistle"              "isaac"                  "a"               "line" "off-axis reflector"               "text"

As you can see, the output is still gibberish. "This" is lemmatized as "thistle" and so on. It may be that I have the dictionaries configured improperly, so you might have better luck. If you are committed to wordnet, for some reason, I suggest you contact the package authors.

来源：https://stackoverflow.com/questions/26196036/r-error-in-lemmatizzation-a-corpus-of-document-with-wordnet

标签

wordnet

lemmatization