Testing text classification ML model with new data fails

旧时模样 提交于 2020-12-23 18:06:03

问题


I have built a machine learning model to classify emails as spams or not. Now i want to test my own email and see the result. So i wrote the following code to classify the new email:

message = """Subject: Hello this is from google security team we want to recover your password. Please contact us 
as soon as possible"""

message = pd.Series([message,])
transformed_message = CountVectorizer(analyzer=process_text).fit_transform(message)
proba = model.predict_proba(transformed_message)[0]

Knowing that process_text is a function to process the email, When I run the code i get the following error:

Number of features of the model must match the input. Model n_features is 37229 and input n_features is 13 

What's the problem and how can i fix that please ?


回答1:


For all data preprocessing steps in such pipelines, we never fit again, as you do here with your (newly defined) count vectorizer.

So, instead of using fit_transform with a new count vectorizer, you should reuse the existing count vectorizer (i.e. the one used with your training data), by applying its transform method. That will allow your new data to be mapped in relation to the 37229 features of the training data (with which the model was trained), instead of the only 13 features produced when you fit again a count vectorizer to such a short text.



来源:https://stackoverflow.com/questions/62521521/testing-text-classification-ml-model-with-new-data-fails

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!