Keep TFIDF result for predicting new content using Scikit for Python

后端未结

关注

 5  501

I am using sklearn on Python to do some clustering. I\'ve trained 200,000 data, and code below works well.

corpus = open(\"token_from_xml.txt\")
vectorizer =


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  滥情空心        
                
              
                            
                2020-12-07 21:35
              
            
            
                                                                       
Instead of using the CountVectorizer for storing the vocabulary, the vocabulary of the tfidfvectorizer can be used directly.

Training phase:

from sklearn.feature_extraction.text import TfidfVectorizer

# tf-idf based vectors
tf = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True, max_features = 500000)

# Fit the model
tf_transformer = tf.fit(corpus)

# Dump the file
pickle.dump(tf_transformer, open("tfidf1.pkl", "wb"))


# Testing phase
tf1 = pickle.load(open("tfidf1.pkl", 'rb'))

# Create new tfidfVectorizer with old vocabulary
tf1_new = TfidfVectorizer(analyzer='word', ngram_range=(1,2), stop_words = "english", lowercase = True,
                          max_features = 500000, vocabulary = tf1.vocabulary_)
X_tf1 = tf1_new.fit_transform(new_corpus)


The fit_transform works here as we are using the old vocabulary. If you were not storing the tfidf, you would have just used transform on the test data. Even when you are doing a transform there, the new documents from the test data are being "fit" to the vocabulary of the vectorizer of the train. That is exactly what we are doing here. The only thing we can store and re-use for a tfidf vectorizer is the vocabulary.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  遥遥无期        
                
              
                            
                2020-12-07 21:46
              
            
            
                                                                       
I successfully saved the feature list by saving vectorizer.vocabulary_, and reuse by CountVectorizer(decode_error="replace",vocabulary=vectorizer.vocabulary_)

Codes below:

corpus = np.array(["aaa bbb ccc", "aaa bbb ffffd"])
vectorizer = CountVectorizer(decode_error="replace")
vec_train = vectorizer.fit_transform(corpus)
#Save vectorizer.vocabulary_
pickle.dump(vectorizer.vocabulary_,open("feature.pkl","wb"))

#Load it later
transformer = TfidfTransformer()
loaded_vec = CountVectorizer(decode_error="replace",vocabulary=pickle.load(open("feature.pkl", "rb")))
tfidf = transformer.fit_transform(loaded_vec.fit_transform(np.array(["aaa ccc eee"])))


That works. tfidf will have same feature length as trained data.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2020-12-07 21:49
              
            
            
                                                                       
you can do the vectorization and tfidf transformation in one stage:

vec =TfidfVectorizer()


then fit and transform on the training data

tfidf = vec.fit_transform(training_data)


and use the tfidf model to transform

unseen_tfidf = vec.transform(unseen_data)
km = KMeans(30)
kmresult = km.fit(tfidf).predict(unseen_tfid)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2020-12-07 21:54
              
            
            
                                                                       
If you want to store features list for testing data for use in future, you can do this:

tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))

#store the content
with open("x_result.pkl", 'wb') as handle:
                    pickle.dump(tfidf, handle)
#load the content
tfidf = pickle.load(open("x_result.pkl", "rb" ) )

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2020-12-07 21:54
              
            
            
                                                                       
a simpler solution, just use joblib libarary as document said:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.externals import joblib

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts)
feature_name = vectorizer.get_feature_names()
tfidf = TfidfTransformer()
tfidf.fit(X)

# save your model in disk
joblib.dump(transformer, 'tfidf.pkl') 

# load your model
tfidf = joblib.load('tfidf.pkl') 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复