Python NLTK Lemmatization of the word 'further' with wordnet

问题

I'm working on a lemmatizer using python, NLTK and the WordNetLemmatizer. Here is a random text that output what I was expecting

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
lem = WordNetLemmatizer()
lem.lemmatize('worse', pos=wordnet.ADJ) // here, we are specifying that 'worse' is an adjective

Output: 'bad'

lem.lemmatize('worse', pos=wordnet.ADV) // here, we are specifying that 'worse' is an adverb

Output: 'worse'

Well, everything here is fine. The behaviour is the same with other adjectives like 'better' (for an irregular form) or 'older' (note that the same test with 'elder' will never output 'old', but I guess that wordnet is not an exhaustive list of all the existing english word)

My question comes when trying with the word 'furter':

lem.lemmatize('further', pos=wordnet.ADJ) // as an adjective

Output: 'further'

lem.lemmatize('further', pos=wordnet.ADV) // as an adverb

Output: 'far'

This is the exact opposite behaviour of the one for the 'worse' word!

Can anybody explain me why ? Is it a bug coming from the wordnet synsets data or does it come from my misunderstanding of the english grammar ?

Please excuse me if the question is already answered, I've search on google and SO, but when specifying the keyword "further", I can find anything related but mess because of the popularity of this word...

Thank you in advance, Romain G.

回答1:

WordNetLemmatizer uses the ._morphy function to access its a word's lemma; from http://www.nltk.org/_modules/nltk/stem/wordnet.html and returns the possible lemmas with the minimum length.

def lemmatize(self, word, pos=NOUN):
    lemmas = wordnet._morphy(word, pos)
    return min(lemmas, key=len) if lemmas else word

And the ._morphy function apply rules iteratively to get a lemma; the rules keep reducing the length of the word and substituting the affixes with the MORPHOLOGICAL_SUBSTITUTIONS. then it sees whether there are other words that are shorter but the same as the reduced word:

def _morphy(self, form, pos):
    # from jordanbg:
    # Given an original string x
    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    # 2. Return all that are in the database
    # 3. If there are no matches, keep applying rules until you either
    #    find a match or you can't go any further

    exceptions = self._exception_map[pos]
    substitutions = self.MORPHOLOGICAL_SUBSTITUTIONS[pos]

    def apply_rules(forms):
        return [form[:-len(old)] + new
                for form in forms
                for old, new in substitutions
                if form.endswith(old)]

    def filter_forms(forms):
        result = []
        seen = set()
        for form in forms:
            if form in self._lemma_pos_offset_map:
                if pos in self._lemma_pos_offset_map[form]:
                    if form not in seen:
                        result.append(form)
                        seen.add(form)
        return result

    # 0. Check the exception lists
    if form in exceptions:
        return filter_forms([form] + exceptions[form])

    # 1. Apply rules once to the input to get y1, y2, y3, etc.
    forms = apply_rules([form])

    # 2. Return all that are in the database (and check the original too)
    results = filter_forms([form] + forms)
    if results:
        return results

    # 3. If there are no matches, keep applying rules until we find a match
    while forms:
        forms = apply_rules(forms)
        results = filter_forms(forms)
        if results:
            return results

    # Return an empty list if we can't find anything
    return []

However if the word is in the list of exceptions, it will return a fixed value kept in the exceptions, see _load_exception_map in http://www.nltk.org/_modules/nltk/corpus/reader/wordnet.html:

def _load_exception_map(self):
    # load the exception file data into memory
    for pos, suffix in self._FILEMAP.items():
        self._exception_map[pos] = {}
        for line in self.open('%s.exc' % suffix):
            terms = line.split()
            self._exception_map[pos][terms[0]] = terms[1:]
    self._exception_map[ADJ_SAT] = self._exception_map[ADJ]

Going back to your example, worse -> bad and further -> far CANNOT be achieved from the rules, thus it has to be from the exception list. Since it's an exception list, there are bound to be inconsistencies.

The exception list are kept in ~/nltk_data/corpora/wordnet/adv.exc and ~/nltk_data/corpora/wordnet/adv.exc.

From adv.exc:

best well
better well
deeper deeply
farther far
further far
harder hard
hardest hard

From adj.exc:

...
worldliest worldly
wormier wormy
wormiest wormy
worse bad
worst bad
worthier worthy
worthiest worthy
wrier wry
...

来源：https://stackoverflow.com/questions/22999273/python-nltk-lemmatization-of-the-word-further-with-wordnet

标签

python

nltk

wordnet

lemmatization