Predicting text data labels in test data set with Weka?

拟墨画扇 提交于 2019-12-04 19:09:39
Sentry

For almost any machine learning algorithm, the training data and the test data need to have the same format. That means, both must have the same features, i.e. attributes in weka, in the same format, including the class.

The problem is probably that you pre-process the training set and the test set independently, and the StrintToWordVectorFilter will create different features for each set. Hence, the model, trained on the training set, is incompatible to the test set.

What you rather want to do is initialize the filter on the training set and then apply it on both training and test set.

The question Weka: ReplaceMissingValues for a test file deals with this issue, but I'll repeat the relevant part here:

Instances train = ...   // from somewhere
Instances test = ...    // from somewhere
Filter filter = new StringToWordVector(); // could be any filter
filter.setInputFormat(train);  // initializing the filter once with training set
Instances newTrain = Filter.useFilter(train, filter);  // configures the Filter based on train instances and returns filtered instances
Instances newTest = Filter.useFilter(test, filter);    // create new test set

Now, you can train the SVM and apply the resulting model on the test data.

If training and testing have to be in separate runs or programs, it should be possible to serialize the initialized filter together with the model. When you load (deserialize) the model, you can also load the filter and apply it on the test data. They should be compatibel now.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!