lemmatization

Lemmatizing words after POS tagging produces unexpected results

老子叫甜甜 提交于 2020-01-05 03:54:05
问题 I am using python3.5 with the nltk pos_tag function and the WordNetLemmatizer. My goal is to flatten words in our database to classify text. I am trying to test using the lemmatizer and I encounter strange behavior when using the POS tagger on identical tokens. In the example below, I have a list of three strings and when running them in the POS tagger every other element is returned as a noun(NN) and the rest are return as verbs (VBG). This affects the lemmatization. The out put looks like

lemmatize plural nouns using nltk and wordnet

喜你入骨 提交于 2020-01-04 02:04:17
问题 I want to lemmatize using from nltk import word_tokenize, sent_tokenize, pos_tag from nltk.stem.wordnet import WordNetLemmatizer from nltk.corpus import wordnet lmtzr = WordNetLemmatizer() POS = pos_tag(text) def get_wordnet_pos(treebank_tag): #maps pos tag so lemmatizer understands from nltk.corpus import wordnet if treebank_tag.startswith('J'): return wordnet.ADJ elif treebank_tag.startswith('V'): return wordnet.VERB elif treebank_tag.startswith('N'): return wordnet.NOUN elif treebank_tag

NLTK words lemmatizing

强颜欢笑 提交于 2020-01-03 17:23:32
问题 I am trying to do lemmatization on words with NLTK . What I can find now is that I can use the stem package to get some results like transform "cars" to "car" and "women" to "woman", however I cannot do lemmatization on some words with affixes like "acknowledgement". When using WordNetLemmatizer() on "acknowledgement", it returns "acknowledgement" and using .PorterStemmer() , it returns "acknowledg" rather than "acknowledge". Can anyone tell me how to eliminate the affixes of words? Say, when

How to lemmatize a list of sentences

[亡魂溺海] 提交于 2020-01-03 02:40:07
问题 How can I lemmatize a list of sentences in Python? from nltk.stem.wordnet import WordNetLemmatizer a = ['i like cars', 'cats are the best'] lmtzr = WordNetLemmatizer() lemmatized = [lmtzr.lemmatize(word) for word in a] print(lemmatized) This is what I've tried but it gives me the same sentences. Do I need to tokenize the words before to work properly? 回答1: TL;DR : pip3 install -U pywsd Then: >>> from pywsd.utils import lemmatize_sentence >>> text = 'i like cars' >>> lemmatize_sentence(text) [

what is the true difference between lemmatization vs stemming?

≡放荡痞女 提交于 2019-12-28 07:36:31
问题 When do I use each ? Also...is the NLTK lemmatization dependent upon Parts of Speech? Wouldn't it be more accurate if it was? 回答1: Short and dense: http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html The goal of both stemming and lemmatization is to reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. However, the two words differ in their flavor. Stemming usually refers to a crude heuristic process that chops off

Python Lemmatizing input list, return output list

依然范特西╮ 提交于 2019-12-24 09:37:15
问题 I have a list containing strings that I am lemmatizing. Though I can lemmatize all the strings, I am having a hard time returning the lemmatized strings in the same list format that I inputted to the lemmatizer. Doing a type of each of the outputs, I get a unicode and str objects. I tried converting the unicode to strings and tried to concatenate the strings to a list but with no luck. Below is the reproducible code: typea = ['colors', 'caresses', 'ponies', 'presumably', 'owed', 'says'] for i

How to use lemmatisation (LemmaGen) in C++

别说谁变了你拦得住时间么 提交于 2019-12-23 02:19:21
问题 I'm using LemmaGen (http://lemmatise.ijs.si) for text lemmatisation. I've successfully used it by running the following statement in the command line. $lemmatize -l $./data/lemmatizer/lem-m-en.bin input.txt output.txt However, I actually want to use it as a library in my C++ project programatically. Any one knows how to use LemmaGen C++ API? Thanks! Or anyone can suggest other C++ lemmatisation library that can be used in C++ programmatically? Please correct me if I'm asking the question

Is it possible to speed up Wordnet Lemmatizer?

旧时模样 提交于 2019-12-20 12:15:14
问题 I'm using the Wordnet Lemmatizer via NLTK on the Brown Corpus (to determine if the nouns in it are used more in their singular form or their plural form). i.e. from nltk.stem.wordnet import WordNetLemmatizer l = WordnetLemmatizer() I've noticed that even the simplest queries such as the one below takes quite a long time (at least a second or two). l("cats") Presumably this is because a web connection must be made to Wordnet for each query?.. I'm wondering if there is a way to still use the

Looking for a database or text file of english words with their different forms

生来就可爱ヽ(ⅴ<●) 提交于 2019-12-19 19:46:57
问题 I am working on a project and I need to get the root of a given word (stemming). As you know, the stemming algorithms that don't use a dictionary are not accurate. Also I tried the WordNet but it is not good for my project. I found phpmorphy project but it doesn't include API in Java. At this time I am looking for a database or a text file of english words with their different forms. for example: run running ran ... include including included ... ... Thank you for your help or advise. 回答1:

How to turn plural words singular?

半世苍凉 提交于 2019-12-18 10:53:25
问题 I'm preparing some table names for an ORM, and I want to turn plural table names into single entity names. My only problem is finding an algorithm that does it reliably. Here's what I'm doing right now: If a word ends with -ies , I replace the ending with -y If a word ends with -es , I remove this ending. This doesn't always work however - for example, it replaces Types with Typ Otherwise, I just remove the trailing -s Does anyone know of a better algorithm? 回答1: Those are all general rules