nltk stemmer: string index out of range

前端未结

关注

 2  1809

I have a set of pickled text documents which I would like to stem using nltk\'s PorterStemmer. For reasons specific to my project, I would like to do the stemming i

相关标签:

2条回答

旧巷少年郎

2021-02-05 13:34
This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.

I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running
```
pip install -U nltk
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

爱一瞬间的悲伤

2021-02-05 13:52

I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:

>>> rule
(u'at', u'ate', None)
>>> word
u'o'

At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.

If I'm not mistaken, in NLTK 3.2 the relative method was the following:

def _doublec(self, word):
    """doublec(word) is TRUE <=> word ends with a double consonant"""
    if len(word) < 2:
        return False
    if (word[-1] != word[-2]):      
        return False        
    return self._cons(word, len(word)-1)

As far as I can see, the len(word) < 2 check is missing in the new version.

Changing _ends_double_consonant() to something like this should work:

def _ends_double_consonant(self, word):
      """Implements condition *d from the paper

      Returns True if word ends with a double consonant
      """
      if len(word) < 2:
          return False
      return (
          word[-1] == word[-2] and
          self._is_consonant(word, len(word)-1)
      )

I just proposed this change in the related NLTK issue.

0 讨论(0)