Find different realization of a word in a sentence string - Python

廉价感情. 提交于 2019-12-24 11:23:33

问题


(This question is with regards to string checking in general and not Natural Language Procesisng per se, but if you view it as an NLP problem, imagine it's not a langauge that current analyzers can analye, for simplicity sake, i'll use english strings as e.g.)

lets say there are only 6 possible form that a word can be realized in

  1. the initial letter being capitalized
  2. its plural form with an "s"
  3. its plural form with an "es"
  4. capitalized + "es"
  5. capitalized + "s"
  6. the basic form without plural or capitalization

let's say i want to find the index of the 1st instance any form of the word coach occurs in a sentence, is there a simpler way of doing these 2 methods:

long if condition

sentence = "this is a sentence with the Coaches"
target = "coach"

print target.capitalize()

for j, i in enumerate(sentence.split(" ")):
  if i == target.capitalize() or i == target.capitalize()+"es" or \
     i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
     i == target:
    print j

iterating try-except

variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]

ind = 0
for i in variations:
  try:
    j == sentence.split(" ").index(i)
    print j
  except ValueError:
    continue

回答1:


I recommend having a look at the stem package of NLTK: http://nltk.org/api/nltk.stem.html

Using it you can "remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word."

If your language is not covered by NLTK currently, you should consider extending NLTK. If you really need something simple and don't bother about NLTK, then you should still write your code as a collection of small, easy to combine utility functions, for example:

import string 

def variation(stem, word):
    return word.lower() in [stem, stem + 'es', stem + 's']

def variations(sentence, stem):
    sentence = cleanPunctuation(sentence).split()
    return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )

def cleanPunctuation(sentence):
    exclude = set(string.punctuation)
    return ''.join(ch for ch in sentence if ch not in exclude)

def firstVariation(sentence, stem):
    for i, w  in variations(sentence, stem):
        return i, w

sentence = "First coach, here another two coaches. Coaches are nice."

print firstVariation(sentence, 'coach')

# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])



回答2:


Morphology is typically a finite-state phenomenon, so regular expressions are the perfect tool to handle it. Build an RE that matches all of the cases with a function like:

def inflect(stem):
    """Returns an RE that matches all inflected forms of stem."""
    pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
    return re.compile(pat)

Usage:

>>> sentence = "this is a sentence with the Coaches"
>>> target = inflect("coach")
>>> [(i, w) for i, w in enumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]

If the inflection rules get more complicated than this, consider using Python's verbose REs.



来源:https://stackoverflow.com/questions/13237533/find-different-realization-of-a-word-in-a-sentence-string-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!