Unable to evaluate score using decision_function() in Logistic Regression

ⅰ亾dé卋堺 提交于 2019-12-14 02:36:17

问题


I'm doing this Univ. Of Washington assignment where i have to predict the score of sample_test_matrix (last few lines) using decision_function() in LogisticRegression . But the error that i'm getting is

    ValueError: X has 145 features per sample; expecting 113092

Here is the code :

   import pandas as pd 
   import numpy as np 
   from sklearn.linear_model import LogisticRegression

   products = pd.read_csv('amazon_baby.csv')

   def remove_punct (text) :
       import string 
       text = str(text)
       for i in string.punctuation:
          text = text.replace(i,"")
       return(text)

   products['review_clean'] = products['review'].apply(remove_punct)
   products = products[products.rating != 3]
   products['sentiment'] = products['rating'].apply(lambda x : +1 if x > 3 else  -1 )

   train_data_index = pd.read_json('module-2-assignment-train-idx.json')
   test_data_index = pd.read_json('module-2-assignment-test-idx.json')

   train_data = products.loc[train_data_index[0], :]
   test_data = products.loc[test_data_index[0], :]
   train_data = train_data.dropna()
   test_data = test_data.dropna()

   from sklearn.feature_extraction.text import CountVectorizer

   train_matrix = vectorizer.fit_transform(train_data['review_clean'])
   test_matrix = vectorizer.fit_transform(test_data['review_clean'])

   sentiment_model = LogisticRegression()
   sentiment_model.fit(train_matrix, train_data['sentiment'])
   print (sentiment_model.coef_)

   sample_data = test_data[10:13]
   print (sample_data)

   sample_test_matrix = vectorizer.transform(sample_data['review_clean'])
   scores = sentiment_model.decision_function(sample_test_matrix)
   print (scores)

Here is the products data :

          Name                                                         Review                                       Rating  

  0       Planetwise Flannel Wipes                              These flannel wipes are OK, but in my opinion ...       3  


  1       Planetwise Wipe Pouch                                 it came early and was not disappointed. i love...       5  


  2       Annas Dream Full Quilt with 2 Shams                   Very soft and comfortable and warmer than it l...       5  

  3       Stop Pacifier Sucking without tears with Thumb...     This is a product well worth the purchase.  I ...       5

  4       Stop Pacifier Sucking without tears with Thumb...      All of my kids have cried non-stop when I trie...       5 

回答1:


This line is causing errors in the subsequent lines:

test_matrix = vectorizer.fit_transform(test_data['review_clean'])

Change the above to this:

test_matrix = vectorizer.transform(test_data['review_clean'])

Explanation: Using fit_transform() will refit the CountVectorizer on the test data. So all the information about the training data will be lost and vocabulary will be calculated only from test data.

Then you are using that vectorizer object to transform the sample_data['review_clean']. So the features in that will be only those which are learnt from test_data.

But the sentiment_model is trained on vocabulary from train_data. Hence the features are different.

Always use transform() on test data, never fit_transform().



来源:https://stackoverflow.com/questions/47204919/unable-to-evaluate-score-using-decision-function-in-logistic-regression

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!