问题
(This question is with regards to string checking in general and not Natural Language Procesisng per se, but if you view it as an NLP problem, imagine it's not a langauge that current analyzers can analye, for simplicity sake, i'll use english strings as e.g.)
lets say there are only 6 possible form that a word can be realized in
- the initial letter being capitalized
- its plural form with an "s"
- its plural form with an "es"
- capitalized + "es"
- capitalized + "s"
- the basic form without plural or capitalization
let's say i want to find the index of the 1st instance any form of the word coach
occurs in a sentence, is there a simpler way of doing these 2 methods:
long if condition
sentence = "this is a sentence with the Coaches"
target = "coach"
print target.capitalize()
for j, i in enumerate(sentence.split(" ")):
if i == target.capitalize() or i == target.capitalize()+"es" or \
i == target.capitalize()+"s" or i == target+"es" or i==target+"s" or \
i == target:
print j
iterating try-except
variations = [target, target+"es", target+"s", target.capitalize()+"es",
target.capitalize()+"s", target.capitalize()]
ind = 0
for i in variations:
try:
j == sentence.split(" ").index(i)
print j
except ValueError:
continue
回答1:
I recommend having a look at the stem package of NLTK: http://nltk.org/api/nltk.stem.html
Using it you can "remove morphological affixes from words, leaving only the word stem. Stemming algorithms aim to remove those affixes required for eg. grammatical role, tense, derivational morphology leaving only the stem of the word."
If your language is not covered by NLTK currently, you should consider extending NLTK. If you really need something simple and don't bother about NLTK, then you should still write your code as a collection of small, easy to combine utility functions, for example:
import string
def variation(stem, word):
return word.lower() in [stem, stem + 'es', stem + 's']
def variations(sentence, stem):
sentence = cleanPunctuation(sentence).split()
return ( (i, w) for i, w in enumerate(sentence) if variation(stem, w) )
def cleanPunctuation(sentence):
exclude = set(string.punctuation)
return ''.join(ch for ch in sentence if ch not in exclude)
def firstVariation(sentence, stem):
for i, w in variations(sentence, stem):
return i, w
sentence = "First coach, here another two coaches. Coaches are nice."
print firstVariation(sentence, 'coach')
# print all variations/forms of 'coach' found in the sentence:
print "\n".join([str(i) + ' ' + w for i,w in variations(sentence, 'coach')])
回答2:
Morphology is typically a finite-state phenomenon, so regular expressions are the perfect tool to handle it. Build an RE that matches all of the cases with a function like:
def inflect(stem):
"""Returns an RE that matches all inflected forms of stem."""
pat = "^[%s%s]%s(?:e?s)$" % (stem[0], stem[0].upper(), re.escape(stem[1:]))
return re.compile(pat)
Usage:
>>> sentence = "this is a sentence with the Coaches"
>>> target = inflect("coach")
>>> [(i, w) for i, w in enumerate(sentence.split()) if re.match(target, w)]
[(6, 'Coaches')]
If the inflection rules get more complicated than this, consider using Python's verbose REs.
来源:https://stackoverflow.com/questions/13237533/find-different-realization-of-a-word-in-a-sentence-string-python