Getting the closest noun from a stemmed word

问题

Short version:
If I have a stemmed word:
Say 'comput' for 'computing', or 'sugari' for 'sugary'
Is there a way to construct it's closest noun form?
That is 'computer', or 'sugar' respectively

Longer version:
I'm using python and NLTK, Wordnet to perform a few semantic similarity tasks on a bunch of words.
I noticed that most sem-sim scores work well only for nouns, while adjectives and verbs don't give any results.
Understanding the inaccuracies involved, I wanted to convert a word from its verb/adjective form to its noun form, so I may get an estimate of their similarity (instead of the 'NONE' that normally gets returned with adjectives).

I thought one way to do this would be to use a stemmer to get at the root word, and then try to construct the closest noun form of that root.
George-Bogdan Ivanov's algorithm from here works pretty well. I wanted to try alternative approaches. Is there any better way to convert a word from adjective/verb form to noun form?

回答1:

You might want to look at this example:

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> WordNetLemmatizer().lemmatize('having','v')
'have'

(from this SO answer) to see if it sends you in the right direction.

回答2:

First extract all the possible candidates from wordnet synsets. Then use difflib to compare the strings against your target stem.

>>> from nltk.corpus import wordnet as wn
>>> from itertools import chain
>>> from difflib import get_close_matches as gcm
>>> target = "comput"
>>> candidates = set(chain(*[ss.lemma_names for ss in wn.all_synsets('n') if len([i for i in ss.lemma_names if target in i]) > 0]))
>>> gcm(target,candidates)[0]

A more human readable way to compute the candidates is as such:

candidates = set()
for ss in wn.all_synsets('n'):
  for ln in ss.lemma_names: # get all possible lemmas for this synset.
    for lemma in ln:
      if target in lemma:
        candidates.add(target)

来源：https://stackoverflow.com/questions/17083442/getting-the-closest-noun-from-a-stemmed-word

标签

python

nltk

wordnet

stemming

pos-tagger