I\'m trying to add stemming to my pipeline in NLP with sklearn.
from nltk.stem.snowball import FrenchStemmer
stop = stopwords.words(\'french\')
stemmer = French
I know I am little late in posting my answer. But here it is, in case someone still needs help.
Following is the cleanest approach to add language stemmer to count vectorizer by overriding build_analyser()
from sklearn.feature_extraction.text import CountVectorizer
import nltk.stem
french_stemmer = nltk.stem.SnowballStemmer('french')
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)])
vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french')
You can freely call fit
and transform
functions of CountVectorizer class over your vectorizer_s
object