问题
I'm doing this Univ. Of Washington assignment where i have to predict the score of sample_test_matrix (last few lines) using decision_function() in LogisticRegression . But the error that i'm getting is
ValueError: X has 145 features per sample; expecting 113092
Here is the code :
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
products = pd.read_csv('amazon_baby.csv')
def remove_punct (text) :
import string
text = str(text)
for i in string.punctuation:
text = text.replace(i,"")
return(text)
products['review_clean'] = products['review'].apply(remove_punct)
products = products[products.rating != 3]
products['sentiment'] = products['rating'].apply(lambda x : +1 if x > 3 else -1 )
train_data_index = pd.read_json('module-2-assignment-train-idx.json')
test_data_index = pd.read_json('module-2-assignment-test-idx.json')
train_data = products.loc[train_data_index[0], :]
test_data = products.loc[test_data_index[0], :]
train_data = train_data.dropna()
test_data = test_data.dropna()
from sklearn.feature_extraction.text import CountVectorizer
train_matrix = vectorizer.fit_transform(train_data['review_clean'])
test_matrix = vectorizer.fit_transform(test_data['review_clean'])
sentiment_model = LogisticRegression()
sentiment_model.fit(train_matrix, train_data['sentiment'])
print (sentiment_model.coef_)
sample_data = test_data[10:13]
print (sample_data)
sample_test_matrix = vectorizer.transform(sample_data['review_clean'])
scores = sentiment_model.decision_function(sample_test_matrix)
print (scores)
Here is the products data :
Name Review Rating
0 Planetwise Flannel Wipes These flannel wipes are OK, but in my opinion ... 3
1 Planetwise Wipe Pouch it came early and was not disappointed. i love... 5
2 Annas Dream Full Quilt with 2 Shams Very soft and comfortable and warmer than it l... 5
3 Stop Pacifier Sucking without tears with Thumb... This is a product well worth the purchase. I ... 5
4 Stop Pacifier Sucking without tears with Thumb... All of my kids have cried non-stop when I trie... 5
回答1:
This line is causing errors in the subsequent lines:
test_matrix = vectorizer.fit_transform(test_data['review_clean'])
Change the above to this:
test_matrix = vectorizer.transform(test_data['review_clean'])
Explanation: Using fit_transform() will refit the CountVectorizer on the test data. So all the information about the training data will be lost and vocabulary will be calculated only from test data.
Then you are using that vectorizer
object to transform the sample_data['review_clean']
. So the features in that will be only those which are learnt from test_data
.
But the sentiment_model
is trained on vocabulary from train_data
. Hence the features are different.
Always use transform()
on test data, never fit_transform()
.
来源:https://stackoverflow.com/questions/47204919/unable-to-evaluate-score-using-decision-function-in-logistic-regression