Error predicting: X has n features per sample, expecting m

问题

I got the following code, where I transform a text to tf:

...
x_train, x_test, y_train, y_test = model_selection.train_test_split(dataset['documents'],dataset['classes'],test_size=test_percentil)
#Term document matrix
count_vect = CountVectorizer(ngram_range=(1, Ngram), min_df=1, max_features=MaxVocabulary)
x_train_counts = count_vect.fit_transform(x_train)
x_test_counts=count_vect.transform(x_test)
#Term Inverse-Frequency
tf_transformer = TfidfTransformer(use_idf=True).fit(x_train_counts)
lista=tf_transformer.get_params()
x_train_tf = tf_transformer.transform(x_train_counts)
x_test_tf=tf_transformer.transform(x_test_counts)
...

Then, I train a model and save it using pickle. The problem comes when, in another program, I try to predict new data. Basically, I got:

count_vect = CountVectorizer(ngram_range=(1, 1), min_df=1, max_features=None)
x_counts = count_vect.fit_transform(dataset['documents'])

#Term Inverse-Frequency
tf_transformer = TfidfTransformer(use_idf=True).fit(x_counts)
x_tf = tf_transformer.transform(x_train_counts)

model.predict(x_tf)

When I execute this code, the output is

ValueError: X has 8933 features per sample; expecting 7488

I know this is a problem with the TfIdf representation, and I hear that I need to use the same tf_transformer and vectorizer to get the expected input shape, but I don't know how to achieve this. I can store the others transformers and vectorizers, but I have tried using different combinations and I got nothing.

回答1:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.externals import joblib
a = pd.Series(["hello, this is me", "hello this is me too"])
b = pd.Series(["hello, this is John", "hi it's Doe"])
tfidf = TfidfVectorizer().fit(a)
joblib.dump(tfidf, 'path_to/tfidf.pkl')
tfidf = joblib.load('path_to/tfidf.pkl')
tfidf.transform(b).todense()

回答2:

In another program you are instantiating a new object, which will not know that previous data has those many columns.

You need to save the CountVectorizer and TfidfTransformer the same way as you saved the model and load them the same way in another program.

Also, you can just use the TfidfVectorizer instead of CountVectorizer + TfidfTransformer, because it does the combined thing and will make your work (saving and loading them easier).

So during training do this:

...
x_train, x_test, y_train, y_test = model_selection.train_test_split(dataset['documents'],dataset['classes'],test_size=test_percentil)
#Term document matrix
tf_vect = TfidfVectorizer(ngram_range=(1, Ngram), min_df=1, max_features=MaxVocabulary, use_idf=True)
x_train_tf = tf_vect.fit_transform(x_train)
x_test_tf = tf_vect.transform(x_test)

...

来源：https://stackoverflow.com/questions/51208115/error-predicting-x-has-n-features-per-sample-expecting-m

标签

python

python-3.x

scikit-learn

tf-idf