How to perform Lemmatization in R?

假如想象 提交于 2019-12-02 16:21:43

Hello you can try package koRpus which allow to use Treetagger :

tagged.results <- treetag(c("run", "ran", "running"), treetagger="manual", format="obj",
                      TT.tknz=FALSE , lang="en",
                      TT.options=list(path="./TreeTagger", preset="en"))
tagged.results@TT.res

##     token tag lemma lttr wclass                               desc stop stem
## 1     run  NN   run    3   noun             Noun, singular or mass   NA   NA
## 2     ran VVD   run    3   verb                   Verb, past tense   NA   NA
## 3 running VVG   run    7   verb Verb, gerund or present participle   NA   NA

See the lemma column for the result you're asking for.

As a previous post mentioned, the function lemmatize_words() from the R package textstem can perform this and give you what I understand as your desired results:

library(textstem)
vector <- c("run", "ran", "running")
lemmatize_words(vector)

## [1] "run" "run" "run"

@Andy and @Arunkumar are correct when they say textstem library can be used to perform stemming and/or lemmatization. However, lemmatize_words() will only work on a vector of words. But in a corpus, we do not have vector of words; we have strings, with each string being a document's content. Hence, to perform lemmatization on a corpus, you can use function lemmatize_strings() as an argument to tm_map() of tm package.

> corpus[[1]]
[1] " earnest roughshod document serves workable primer regions recent history make 
terrific th-grade learning tool samuel beckett applied iranian voting process bard 
black comedy willie loved another trumpet blast may new mexican cinema -bornin "
> corpus <- tm_map(corpus, lemmatize_strings)
> corpus[[1]]
[1] "earnest roughshod document serve workable primer region recent history make 
terrific th - grade learn tool samuel beckett apply iranian vote process bard black 
comedy willie love another trumpet blast may new mexican cinema - bornin"

Do not forget to run the following line of code after you have done lemmatization:

> corpus <- tm_map(corpus, PlainTextDocument)

This is because in order to create a document-term matrix, you need to have 'PlainTextDocument' type object, which gets changed after you use lemmatize_strings() (to be more specific, the corpus object does not contain content and meta-data of each document anymore - it is now just a structure containing documents' content; this is not the type of object that DocumentTermMatrix() takes as an argument).

Hope this helps!

Maybe stemming is enough for you? Typical natural language processing tasks make do with stemmed texts. You can find several packages from CRAN Task View of NLP: http://cran.r-project.org/web/views/NaturalLanguageProcessing.html

If you really do require something more complex, then there's specialized solutsions based on mapping sentences to neural nets. As far as I know, these require massive amount of training data. There is lots of open software created and made available by Stanford NLP Group.

If you really want to dig into the topic, then you can dig through the event archives linked at the same Stanford NLP Group publications section. There's some books on the topic as well.

Arunkumar CR

Lemmatization can be done in R easily with textStem package.
Steps are:
1) Install textstem
2) Load the package by library(textstem)
3) stem_word=lemmatize_words(word, dictionary = lexicon::hash_lemmas)
where stem_word is the result of lemmatization and word is the input word.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!