Lemmatizer in R or python (am, are, is -> be?) [closed]

ぃ、小莉子 提交于 2019-11-30 10:32:24

So here is a way to do it in R, using the Northwestern University lemmatizer, MorphAdorner.

lemmatize <- function(wordlist) {
  get.lemma <- function(word, url) {
    response <- GET(url,query=list(spelling=word,standardize="",
                                   wordClass="",wordClass2="",
                                   corpusConfig="ncf",    # Nineteenth Century Fiction
                                   media="xml"))
    content <- content(response,type="text")
    xml     <- xmlInternalTreeParse(content)
    return(xmlValue(xml["//lemma"][[1]]))    
  }
  require(httr)
  require(XML)
  url <- "http://devadorner.northwestern.edu/maserver/lemmatizer"
  return(sapply(wordlist,get.lemma,url=url))
}

words <- c("is","am","was","are")
lemmatize(words)
#   is   am  was  are 
# "be" "be" "be" "be" 

As I suspect you are aware, correct lemmatization requires knowledge of the word class (part of speech), contextually correct spelling, and also depends upon which corpus is being used.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!