nltk stemmer: string index out of range

前端 未结 2 1809
轮回少年
轮回少年 2021-02-05 13:11

I have a set of pickled text documents which I would like to stem using nltk\'s PorterStemmer. For reasons specific to my project, I would like to do the stemming i

相关标签:
2条回答
  • 2021-02-05 13:34

    This is an NLTK bug specific to NLTK version 3.2.2, for which I am to blame. It was introduced by PR https://github.com/nltk/nltk/pull/1261 which rewrote the Porter stemmer.

    I wrote a fix which went out in NLTK 3.2.3. If you're on version 3.2.2 and want the fix, just upgrade - e.g. by running

    pip install -U nltk
    
    0 讨论(0)
  • 2021-02-05 13:52

    I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:

    >>> rule
    (u'at', u'ate', None)
    >>> word
    u'o'
    

    At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.

    If I'm not mistaken, in NLTK 3.2 the relative method was the following:

    def _doublec(self, word):
        """doublec(word) is TRUE <=> word ends with a double consonant"""
        if len(word) < 2:
            return False
        if (word[-1] != word[-2]):      
            return False        
        return self._cons(word, len(word)-1)
    

    As far as I can see, the len(word) < 2 check is missing in the new version.

    Changing _ends_double_consonant() to something like this should work:

    def _ends_double_consonant(self, word):
          """Implements condition *d from the paper
    
          Returns True if word ends with a double consonant
          """
          if len(word) < 2:
              return False
          return (
              word[-1] == word[-2] and
              self._is_consonant(word, len(word)-1)
          )
    

    I just proposed this change in the related NLTK issue.

    0 讨论(0)
提交回复
热议问题