I\'m trying to add stemming to my pipeline in NLP with sklearn.
from nltk.stem.snowball import FrenchStemmer
stop = stopwords.words(\'french\')
stemmer = French
I know I am little late in posting my answer. But here it is, in case someone still needs help.
Following is the cleanest approach to add language stemmer to count vectorizer by overriding build_analyser()
from sklearn.feature_extraction.text import CountVectorizer
import nltk.stem
french_stemmer = nltk.stem.SnowballStemmer('french')
class StemmedCountVectorizer(CountVectorizer):
def build_analyzer(self):
analyzer = super(StemmedCountVectorizer, self).build_analyzer()
return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)])
vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french')
You can freely call fit
and transform
functions of CountVectorizer class over your vectorizer_s
object
You can pass a callable as analyzer
to the CountVectorizer
constructor to provide a custom analyzer. This appears to work for me.
from sklearn.feature_extraction.text import CountVectorizer
from nltk.stem.snowball import FrenchStemmer
stemmer = FrenchStemmer()
analyzer = CountVectorizer().build_analyzer()
def stemmed_words(doc):
return (stemmer.stem(w) for w in analyzer(doc))
stem_vectorizer = CountVectorizer(analyzer=stemmed_words)
print(stem_vectorizer.fit_transform(['Tu marches dans la rue']))
print(stem_vectorizer.get_feature_names())
Prints out:
(0, 4) 1
(0, 2) 1
(0, 0) 1
(0, 1) 1
(0, 3) 1
[u'dan', u'la', u'march', u'ru', u'tu']
You can try:
def build_analyzer(self):
analyzer = super(CountVectorizer, self).build_analyzer()
return lambda doc:(stemmer.stem(w) for w in analyzer(doc))
and remove the __init__
method.