问题
I'm lemmatizing the Ted Dataset Transcript. There's something strange I notice: Not all words are being lemmatized. To say,
selected -> select
Which is right.
However, involved !-> involve
and horsing !-> horse
unless I explicitly input the 'v' (Verb) attribute.
On the python terminal, I get the right output but not in my code:
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet
>>> lem = WordNetLemmatizer()
>>> lem.lemmatize('involved','v')
u'involve'
>>> lem.lemmatize('horsing','v')
u'horse'
The relevant section of the code is this:
for l in LDA_Row[0].split('+'):
w=str(l.split('*')[1])
word=lmtzr.lemmatize(w)
wordv=lmtzr.lemmatize(w,'v')
print wordv, word
# if word is not wordv:
# print word, wordv
The whole code is here.
What is the problem?
回答1:
The lemmatizer requires the correct POS tag to be accurate, if you use the default settings of the WordNetLemmatizer.lemmatize()
, the default tag is noun, see https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py#L39
To resolve the problem, always POS-tag your data before lemmatizing, e.g.
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... if not wntag:
... lemma = word
... else:
... lemma = wnl.lemmatize(word, wntag)
... print lemma
...
This
be
a
foo
bar
sentence
Note that 'is -> be', i.e.
>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'
To answer the question with words from your examples:
>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... lemma = wnl.lemmatize(word, wntag) if wntag else word
... print lemma
...
These
sentence
involve
some
horse
around
Note that there are some quirks with WordNetLemmatizer:
- wordnet lemmatization and pos tagging in python
- Python NLTK Lemmatization of the word 'further' with wordnet
Also NLTK's default POS tagger is under-going some major changes to improve accuracy:
- Python NLTK pos_tag not returning the correct part-of-speech tag
- https://github.com/nltk/nltk/issues/1110
- https://github.com/nltk/nltk/pull/1143
And for an out-of-the-box / off-the-shelf solution to lemmatizer, you can take a look at https://github.com/alvations/pywsd and how I've made some try-excepts to catch words that are not in WordNet, see https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66
来源:https://stackoverflow.com/questions/32957895/wordnetlemmatizer-not-returning-the-right-lemma-unless-pos-is-explicit-python