Error predicting: X has n features per sample, expecting m

社会主义新天地 提交于 2020-07-20 04:12:10

问题


I got the following code, where I transform a text to tf:

...
x_train, x_test, y_train, y_test = model_selection.train_test_split(dataset['documents'],dataset['classes'],test_size=test_percentil)
#Term document matrix
count_vect = CountVectorizer(ngram_range=(1, Ngram), min_df=1, max_features=MaxVocabulary)
x_train_counts = count_vect.fit_transform(x_train)
x_test_counts=count_vect.transform(x_test)
#Term Inverse-Frequency
tf_transformer = TfidfTransformer(use_idf=True).fit(x_train_counts)
lista=tf_transformer.get_params()
x_train_tf = tf_transformer.transform(x_train_counts)
x_test_tf=tf_transformer.transform(x_test_counts)
...

Then, I train a model and save it using pickle. The problem comes when, in another program, I try to predict new data. Basically, I got:

count_vect = CountVectorizer(ngram_range=(1, 1), min_df=1, max_features=None)
x_counts = count_vect.fit_transform(dataset['documents'])

#Term Inverse-Frequency
tf_transformer = TfidfTransformer(use_idf=True).fit(x_counts)
x_tf = tf_transformer.transform(x_train_counts)

model.predict(x_tf)

When I execute this code, the output is

ValueError: X has 8933 features per sample; expecting 7488

I know this is a problem with the TfIdf representation, and I hear that I need to use the same tf_transformer and vectorizer to get the expected input shape, but I don't know how to achieve this. I can store the others transformers and vectorizers, but I have tried using different combinations and I got nothing.


回答1:


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
a = pd.Series(["hello, this is me", "hello this is me too"])
b = pd.Series(["hello, this is John", "hi it's Doe"])
tfidf = TfidfVectorizer().fit(a)
joblib.dump(tfidf, 'path_to/tfidf.pkl')
tfidf = joblib.load('path_to/tfidf.pkl')
tfidf.transform(b).todense()



回答2:


In another program you are instantiating a new object, which will not know that previous data has those many columns.

You need to save the CountVectorizer and TfidfTransformer the same way as you saved the model and load them the same way in another program.

Also, you can just use the TfidfVectorizer instead of CountVectorizer + TfidfTransformer, because it does the combined thing and will make your work (saving and loading them easier).

So during training do this:

...
x_train, x_test, y_train, y_test = model_selection.train_test_split(dataset['documents'],dataset['classes'],test_size=test_percentil)
#Term document matrix
tf_vect = TfidfVectorizer(ngram_range=(1, Ngram), min_df=1, max_features=MaxVocabulary, use_idf=True)
x_train_tf = tf_vect.fit_transform(x_train)
x_test_tf = tf_vect.transform(x_test)

...


来源:https://stackoverflow.com/questions/51208115/error-predicting-x-has-n-features-per-sample-expecting-m

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!