NLTK Lemmatizer, Extract meaningful words

帅比萌擦擦* 提交于 2019-12-14 04:04:05

问题


Currently, I am going to create a machine learning based code that automatically maps categories.

I am going to do natural language processing before that.

There are several words list.

      sent ='The laughs you two heard were triggered 
             by memories of his own high j-flying 
             moist moisture moisturize moisturizing '.lower().split()

I made the following code. I referenced this url. NLTK: lemmatizer and pos_tag

from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
def lemmatize_all(sentence):
    wnl = WordNetLemmatizer()
    for word, tag in pos_tag(word_tokenize(sentence)):
        if tag.startswith("NN"):
            yield wnl.lemmatize(word, pos='n')
        elif tag.startswith('VB'):
            yield wnl.lemmatize(word, pos='v')
        elif tag.startswith('JJ'):
            yield wnl.lemmatize(word, pos='a')



words = ' '.join(lemmatize_all(' '.join(sent)))

The resulting values are shown below.

laugh heard be trigger memory own high j-flying moist moisture moisturize moisturizing

I am satisfied with the following results.

laughs -> laugh 
were -> be
triggered -> trigger 
memories -> memory 
moist -> moist 

However, the following values are not satisfied.

heard -> heard 
j-flying -> j-flying 
moisture -> moisture 
moisturize -> moisturize 
moisturizing -> moisturizing 

Although it was better than the initial values, I would like the following results.

heard -> hear
j-flying -> fly
moisture -> moist
moisturize -> moist
moisturizing -> moist

If you have any other good way to extract meaningful words, please let me know. Thank you


回答1:


TL;DR

It's an XY problem of a lemmatizer failing to meet your expectation, when the lemmatizer you're using is to solved a different problem.


In Long

Q: What is a lemma?

Lemmatisation (or lemmatization) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form. - Wikipedia

Q: What is the "dictionary form"?

NLTK is using the morphy algorithm which is using WordNet as the basis of "dictionary forms"

See also How does spacy lemmatizer works?. Note SpaCy has additional hacks put in to handle more irregular words.

Q: Why moisture -> moisture and moisturizing -> moisturizing?

Because there are synset (sort of "dictionary form") for "moisture" and "moisturizing"

>>> from nltk.corpus import wordnet as wn

>>> wn.synsets('moisture')
[Synset('moisture.n.01')]
>>> wn.synsets('moisture')[0].definition()
'wetness caused by water'

>>> wn.synsets('moisturizing')
[Synset('humidify.v.01')]
>>> wn.synsets('moisturizing')[0].definition()
'make (more) humid'

Q: How could I get moisture -> moist?

Not really useful. But maybe try a stemmer (but don't expect too much of it)

>>> from nltk.stem import PorterStemmer

>>> porter = PorterStemmer()
>>> porter.stem("moisture")
'moistur'

>>> porter.stem("moisturizing")
'moistur'

Q: Then how do I get moisuturizing/moisuture -> moist?!!

There's no well-founded way to do that. But before even trying to do that, what is the eventual purpose of doing moisuturizing/moisuture -> moist.

Is it really necessary to do that?

If you really want, you can try word vectors and try to look for most similar words but there's a whole other world of caveats that comes with word vectors.

Q: Wait a minute but heard -> heard is ridiculous?!

Yeah, the POS tagger isn't tagging the heard correctly. Most probably because the sentence is not a proper sentence, so the POS tags are wrong for the words in the sentence:

>>> from nltk import word_tokenize, pos_tag
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'

>>> pos_tag(word_tokenize(sent))
[('The', 'DT'), ('laughs', 'NNS'), ('you', 'PRP'), ('two', 'CD'), ('heard', 'NNS'), ('were', 'VBD'), ('triggered', 'VBN'), ('by', 'IN'), ('memories', 'NNS'), ('of', 'IN'), ('his', 'PRP$'), ('own', 'JJ'), ('high', 'JJ'), ('j-flying', 'NN'), ('moist', 'NN'), ('moisture', 'NN'), ('moisturize', 'VB'), ('moisturizing', 'NN'), ('.', '.')]

We see that heard is tagged as NNS (a noun). If we lemmatized it as a verb:

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('heard', pos='v')
'hear'

Q: Then how do I get a correct POS tag?!

Probably with SpaCy, you get ('heard', 'VERB'):

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'
>>> doc = nlp(sent)
>>> [(word.text, word.pos_) for word in doc]
[('The', 'DET'), ('laughs', 'VERB'), ('you', 'PRON'), ('two', 'NUM'), ('heard', 'VERB'), ('were', 'VERB'), ('triggered', 'VERB'), ('by', 'ADP'), ('memories', 'NOUN'), ('of', 'ADP'), ('his', 'ADJ'), ('own', 'ADJ'), ('high', 'ADJ'), ('j', 'NOUN'), ('-', 'PUNCT'), ('flying', 'VERB'), ('moist', 'NOUN'), ('moisture', 'NOUN'), ('moisturize', 'NOUN'), ('moisturizing', 'NOUN'), ('.', 'PUNCT')]

But note, in this case, SpaCy got ('moisturize', 'NOUN') and NLTK got ('moisturize', 'VB').

Q: But can't I get moisturize -> moist with SpaCy?

Lets not go back to the start where we define what is a lemma. In short:

>>> import spacy
>>> nlp = spacy.load('en_core_web_sm')
>>> sent
'The laughs you two heard were triggered by memories of his own high j-flying moist moisture moisturize moisturizing.'
>>> doc = nlp(sent)
>>> [word.lemma_ for word in doc]
['the', 'laugh', '-PRON-', 'two', 'hear', 'be', 'trigger', 'by', 'memory', 'of', '-PRON-', 'own', 'high', 'j', '-', 'fly', 'moist', 'moisture', 'moisturize', 'moisturizing', '.']

See also How does spacy lemmatizer works? (again)

Q: Okay, fine. I can't get moisturize -> moist... And POS tag is not perfect for heard -> hear. But why can't I get j-flying -> fly?

Back to the question of why do you need to convert j-flying -> fly, there are counter examples of why you wouldn't want to separate something that looks like a compound.

For example:

  • Should Classical-sounding go to sound?
  • Should X-fitting go to fit?
  • Should crash-landing go to landing?

Depends on what's the ultimate purpose of your application, converting a token to your desired form may or may not be necessary.

Q: Then what is a good way to extract meaningful words?

I sound like a broken record but it depends on what's your ultimate goal?

If you goal is really to understand the meaning of words then you have to ask yourself the question, "What is the meaning of meaning?"

Does individual word has a meaning out of its context? Or would it have the sum of meanings from all the possible context it could occur in.

Au currant, the state-of-art basically treats all meanings as an array of floats and comparisons between array of floats are what give meaning its meaning. But is that really meaning or just an means to an end? (Pun intended).

Q: Why am I get more questions than answers?

Welcome to the world of computational linguistics which has its roots from philosophy (like computer science). Natural language processing is commonly known as the application of computational linguistics


Food for thought

Q: Is a lemmatizer better than a stemmer?

A: No definite answer. (c.f. Stemmers vs Lemmatizers)




回答2:


Lemmatization is not an easy task. You should not expect perfect results. Yiu can however see if you like the results of other lemmatization libraries better.

Spacy is an obvious Python option to evaluate. Stanford core nlp is another(JVM based and GPLed).

There are oher options, none will be perfect.



来源:https://stackoverflow.com/questions/52393591/nltk-lemmatizer-extract-meaningful-words

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!