lemmatization

How to use lemmatisation (LemmaGen) in C++

匆匆过客 提交于 2019-12-08 03:31:27
I'm using LemmaGen ( http://lemmatise.ijs.si ) for text lemmatisation. I've successfully used it by running the following statement in the command line. $lemmatize -l $./data/lemmatizer/lem-m-en.bin input.txt output.txt However, I actually want to use it as a library in my C++ project programatically. Any one knows how to use LemmaGen C++ API? Thanks! Or anyone can suggest other C++ lemmatisation library that can be used in C++ programmatically? Please correct me if I'm asking the question wrongly as I'm still quite new to C++. 来源: https://stackoverflow.com/questions/37151476/how-to-use

WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

喜你入骨 提交于 2019-12-07 17:22:07
问题 I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say, selected -> select Which is right. However, involved !-> involve and horsing !-> horse unless I explicitly input the 'v' (Verb) attribute. On the python terminal, I get the right output but not in my code: >>> from nltk.stem import WordNetLemmatizer >>> from nltk.corpus import wordnet >>> lem = WordNetLemmatizer() >>> lem.lemmatize('involved','v') u'involve' >>> lem

WordNetLemmatizer not returning the right lemma unless POS is explicit - Python NLTK

无人久伴 提交于 2019-12-06 02:37:15
I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say, selected -> select Which is right. However, involved !-> involve and horsing !-> horse unless I explicitly input the 'v' (Verb) attribute. On the python terminal, I get the right output but not in my code : >>> from nltk.stem import WordNetLemmatizer >>> from nltk.corpus import wordnet >>> lem = WordNetLemmatizer() >>> lem.lemmatize('involved','v') u'involve' >>> lem.lemmatize('horsing','v') u'horse' The relevant section of the code is this: for l in LDA_Row[0].split('+')

Wordnet Lemmatizer for R

与世无争的帅哥 提交于 2019-12-05 06:57:42
问题 I would like to use the wordnet lemmatizer to lemmatize the words in a > a<-c("He saw a see-saw on a sea shore", "she is feeling cold") > a [1] "He saw a see-saw on a sea shore" "she is feeling cold" I convert a into a corpus and do pre-processing steps (like stopword removal, lemmatization etc) > a <- Corpus(VectorSource(a)) I wanted to do the lemmatization in the below way, > filter <- getTermFilter("ExactMatchFilter", a, TRUE) > terms <- getIndexTerms("NOUN", 1, filter) > sapply(terms,

Lemmatization of non-English words?

旧街凉风 提交于 2019-12-04 10:52:31
问题 I would like to apply lemmatization to reduce the inflectional forms of words. I know that for English language WordNet provides such a functionality, but I am also interested in applying lemmatization for Dutch, French, Spanish and Italian words. Is there any trustworthy and confirmed way to go about this? Thank you! 回答1: Try pattern library from CLIPS, they have support for German, English, Spanish, French and Italian. Just what you needed: http://www.clips.ua.ac.be/pattern Unfortunately it

how to use spacy lemmatizer to get a word into basic form

我怕爱的太早我们不能终老 提交于 2019-12-04 09:01:27
问题 I am new to spacy and I want to use its lemmatizer function, but I don't know how to use it, like I into strings of word, which will return the string with the basic form the words. Examples: 'words'=> 'word' 'did' => 'do' Thank you. 回答1: Previous answer is convoluted and can't be edited, so here's a more conventional one. # make sure your downloaded the english model with "python -m spacy download en" import spacy nlp = spacy.load('en') doc = nlp(u"Apples and oranges are similar. Boots and

Does the lemmatization mechanism reduce the size of the corpus?

非 Y 不嫁゛ 提交于 2019-12-04 04:09:35
问题 Dear Community Members, During the pre-processing of data, after splitting the raw_data into tokens, I have used the popular WordNet Lemmatizer to generate the stems. I am performing experiments on a dataset that has 18953 tokens. My question is, does the lemmatization process reduce the size of corpus? I am confused, kindly help in this regard. Any help is appreciated! 回答1: Lemmatization converts each token (aka form ) in the sentence into its lemma form (aka type ): >>> from nltk import

Wordnet Lemmatizer for R

断了今生、忘了曾经 提交于 2019-12-03 21:38:07
I would like to use the wordnet lemmatizer to lemmatize the words in a > a<-c("He saw a see-saw on a sea shore", "she is feeling cold") > a [1] "He saw a see-saw on a sea shore" "she is feeling cold" I convert a into a corpus and do pre-processing steps (like stopword removal, lemmatization etc) > a <- Corpus(VectorSource(a)) I wanted to do the lemmatization in the below way, > filter <- getTermFilter("ExactMatchFilter", a, TRUE) > terms <- getIndexTerms("NOUN", 1, filter) > sapply(terms, getLemma) but I get this error > filter <- getTermFilter("ExactMatchFilter", a, TRUE) Error in .jnew(paste

Lemmatize French text [closed]

泪湿孤枕 提交于 2019-12-03 04:52:29
问题 Closed . This question needs to be more focused. It is not currently accepting answers. Want to improve this question? Update the question so it focuses on one problem only by editing this post. Closed 7 months ago . I have some text in French that I need to process in some ways. For that, I need to: First, tokenize the text into words Then lemmatize those words to avoid processing the same root more than once As far as I can see, the wordnet lemmatizer in the NLTK only works with English. I

how to use spacy lemmatizer to get a word into basic form

痴心易碎 提交于 2019-12-03 02:56:52
I am new to spacy and I want to use its lemmatizer function, but I don't know how to use it, like I into strings of word, which will return the string with the basic form the words. Examples: 'words'=> 'word' 'did' => 'do' Thank you. damio Previous answer is convoluted and can't be edited, so here's a more conventional one. # make sure your downloaded the english model with "python -m spacy download en" import spacy nlp = spacy.load('en') doc = nlp(u"Apples and oranges are similar. Boots and hippos aren't.") for token in doc: print(token, token.lemma, token.lemma_) Output: Apples 6617 apples