nltk stemmer: string index out of range

前端 未结 2 1811
轮回少年
轮回少年 2021-02-05 13:11

I have a set of pickled text documents which I would like to stem using nltk\'s PorterStemmer. For reasons specific to my project, I would like to do the stemming i

2条回答
  •  爱一瞬间的悲伤
    2021-02-05 13:52

    I debugged nltk.stem.porter module using pdb. After a few iterations, in _apply_rule_list() you get:

    >>> rule
    (u'at', u'ate', None)
    >>> word
    u'o'
    

    At this point the _ends_double_consonant() method tries to do word[-1] == word[-2] and it fails.

    If I'm not mistaken, in NLTK 3.2 the relative method was the following:

    def _doublec(self, word):
        """doublec(word) is TRUE <=> word ends with a double consonant"""
        if len(word) < 2:
            return False
        if (word[-1] != word[-2]):      
            return False        
        return self._cons(word, len(word)-1)
    

    As far as I can see, the len(word) < 2 check is missing in the new version.

    Changing _ends_double_consonant() to something like this should work:

    def _ends_double_consonant(self, word):
          """Implements condition *d from the paper
    
          Returns True if word ends with a double consonant
          """
          if len(word) < 2:
              return False
          return (
              word[-1] == word[-2] and
              self._is_consonant(word, len(word)-1)
          )
    

    I just proposed this change in the related NLTK issue.

提交回复
热议问题