add stemming support to CountVectorizer (sklearn)

后端未结

关注

 3  2015

北荒 2021-01-31 18:51

I\'m trying to add stemming to my pipeline in NLP with sklearn.

from nltk.stem.snowball import FrenchStemmer

stop = stopwords.words(\'french\')
stemmer = French


      
      
        
          3条回答        

        
                    
            
            
                         
                
              
              
                
                   囚心锁ツ
                                             
                
                
                (楼主)
            
              
              
                2021-01-31 19:27
              

            
            
                        
I know I am little late in posting my answer.
But here it is, in case someone still needs help.

Following is the cleanest approach to add language stemmer to count vectorizer by overriding build_analyser()

from sklearn.feature_extraction.text import CountVectorizer
import nltk.stem

french_stemmer = nltk.stem.SnowballStemmer('french')
class StemmedCountVectorizer(CountVectorizer):
    def build_analyzer(self):
        analyzer = super(StemmedCountVectorizer, self).build_analyzer()
        return lambda doc: ([french_stemmer.stem(w) for w in analyzer(doc)])

vectorizer_s = StemmedCountVectorizer(min_df=3, analyzer="word", stop_words='french')


You can freely call fit and transform functions of CountVectorizer class over your vectorizer_s object
    
             
                                                        
            
            
              
                
                0
              
                   
                
               讨论(0)
              
                                                  
              
              
                          
             
       
          
              
                                       
     查看其它3个回答


            
                         
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
                              			
        
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复