add stemming support to CountVectorizer (sklearn)

后端 未结 3 2009
北荒
北荒 2021-01-31 18:51

I\'m trying to add stemming to my pipeline in NLP with sklearn.

from nltk.stem.snowball import FrenchStemmer

stop = stopwords.words(\'french\')
stemmer = French         


        
3条回答
  •  囚心锁ツ
    2021-01-31 19:27

    I know I am little late in posting my answer. But here it is, in case someone still needs help.

    Following is the cleanest approach to add language stemmer to count vectorizer by overriding build_analyser()

    from sklearn.feature_extraction.text import CountVectorizer
    import nltk.stem
    
    french_stemmer = nltk.stem.SnowballStemmer('french')
    class StemmedCountVectorizer(CountVectorizer):
        def build_analyzer(self):
            analyzer = super(StemmedCountVectorizer, self).build_analyzer()
            return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)])
    
    vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french')
    

    You can freely call fit and transform functions of CountVectorizer class over your vectorizer_s object

提交回复
热议问题