问题
I have a set of pickled text documents which I would like to stem using nltk's PorterStemmer
. For reasons specific to my project, I would like to do the stemming inside of a django app view.
However, when stemming the documents inside the django view, I receive an IndexError: string index out of range
exception from PorterStemmer().stem()
for the string 'oed'
. As a result, running the following:
# xkcd_project/search/views.py
from nltk.stem.porter import PorterStemmer
def get_results(request):
s = PorterStemmer()
s.stem('oed')
return render(request, 'list.html')
raises the mentioned error:
Traceback (most recent call last):
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/exception.py", line 39, in inner
response = get_response(request)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 187, in _get_response
response = self.process_exception_by_middleware(e, request)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/django/core/handlers/base.py", line 185, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/Users/jkarimi91/Projects/xkcd_search/xkcd_project/search/views.py", line 15, in get_results
s.stem('oed')
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 665, in stem
stem = self._step1b(stem)
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 376, in _step1b
lambda stem: (self._measure(stem) == 1 and
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 258, in _apply_rule_list
if suffix == '*d' and self._ends_double_consonant(word):
File "//anaconda/envs/xkcd/lib/python2.7/site-packages/nltk/stem/porter.py", line 214, in _ends_double_consonant
word[-1] == word[-2] and
IndexError: string index out of range
Now what is really odd is running the same stemmer on the same string outside django (be it a seperate python file or an interactive python console) produces no error. In other words:
# test.py
from nltk.stem.porter import PorterStemmer
s = PorterStemmer()
print s.stem('oed')
followed by:
python test.py
# successfully prints 'o'
what is causing this issue?
回答1:
This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.
I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running
pip install -U nltk
回答2:
I debugged nltk.stem.porter
module using pdb
. After a few iterations, in _apply_rule_list()
you get:
>>> rule
(u'at', u'ate', None)
>>> word
u'o'
At this point the _ends_double_consonant() method tries to do word[-1] == word[-2]
and it fails.
If I'm not mistaken, in NLTK 3.2
the relative method was the following:
def _doublec(self, word):
"""doublec(word) is TRUE <=> word ends with a double consonant"""
if len(word) < 2:
return False
if (word[-1] != word[-2]):
return False
return self._cons(word, len(word)-1)
As far as I can see, the len(word) < 2
check is missing in the new version.
Changing _ends_double_consonant()
to something like this should work:
def _ends_double_consonant(self, word):
"""Implements condition *d from the paper
Returns True if word ends with a double consonant
"""
if len(word) < 2:
return False
return (
word[-1] == word[-2] and
self._is_consonant(word, len(word)-1)
)
I just proposed this change in the related NLTK issue.
来源:https://stackoverflow.com/questions/41517595/nltk-stemmer-string-index-out-of-range