add stemming support to CountVectorizer (sklearn)

后端 未结 3 2010
北荒
北荒 2021-01-31 18:51

I\'m trying to add stemming to my pipeline in NLP with sklearn.

from nltk.stem.snowball import FrenchStemmer

stop = stopwords.words(\'french\')
stemmer = French         


        
相关标签:
3条回答
  • 2021-01-31 19:27

    I know I am little late in posting my answer. But here it is, in case someone still needs help.

    Following is the cleanest approach to add language stemmer to count vectorizer by overriding build_analyser()

    from sklearn.feature_extraction.text import CountVectorizer
    import nltk.stem
    
    french_stemmer = nltk.stem.SnowballStemmer('french')
    class StemmedCountVectorizer(CountVectorizer):
        def build_analyzer(self):
            analyzer = super(StemmedCountVectorizer, self).build_analyzer()
            return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)])
    
    vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french')
    

    You can freely call fit and transform functions of CountVectorizer class over your vectorizer_s object

    0 讨论(0)
  • 2021-01-31 19:31

    You can pass a callable as analyzer to the CountVectorizer constructor to provide a custom analyzer. This appears to work for me.

    from sklearn.feature_extraction.text import CountVectorizer
    from nltk.stem.snowball import FrenchStemmer
    
    stemmer = FrenchStemmer()
    analyzer = CountVectorizer().build_analyzer()
    
    def stemmed_words(doc):
        return (stemmer.stem(w) for w in analyzer(doc))
    
    stem_vectorizer = CountVectorizer(analyzer=stemmed_words)
    print(stem_vectorizer.fit_transform(['Tu marches dans la rue']))
    print(stem_vectorizer.get_feature_names())
    

    Prints out:

      (0, 4)    1
      (0, 2)    1
      (0, 0)    1
      (0, 1)    1
      (0, 3)    1
    [u'dan', u'la', u'march', u'ru', u'tu']
    
    0 讨论(0)
  • 2021-01-31 19:41

    You can try:

    def build_analyzer(self):
        analyzer = super(CountVectorizer, self).build_analyzer()
        return lambda doc:(stemmer.stem(w) for w in analyzer(doc))
    

    and remove the __init__ method.

    0 讨论(0)
提交回复
热议问题