SMOTE oversampling and cross-validation

问题

I am working on a binary classification problem in Weka with a highly imbalanced data set (90% in one category and 10% in the other). I first applied SMOTE (http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume16/chawla02a-html/node6.html) to the entire data set to even out the categories and then performed 10-fold cross-validation over the newly obtained data. I found (overly?) optimistic results with F1 around 90%.

Is this due to oversampling? Is it bad practice to perform cross-validation on data on which SMOTE is applied? Are there any ways to solve this problem?

回答1:

I think you should split the data on test and training first, then perform SMOTE just on the training part, and then test the algorithm on the part of the dataset that doesn't have synthetic examples, that'll give you a better picture of the performance of the algorithm.

回答2:

According to my experience, dividing the data set by hand is not good way to deal with this problem. When you have 1 data set, you should have cross validation on each classifier you use in a way that 1 fold of your cross validation is your test set_which you should not implement SMOTE on it_ and you have 9 other folds as your training set in which you must have a balanced data set. Repeat this action in a loop for 10 times. Then you will have better result than dividing whole data set by hand.

It is obvious that if you apply SMOTE on both test and training set, you are having synthesized test set which gives you a high accuracy that is not actually correct.

来源：https://stackoverflow.com/questions/31856326/smote-oversampling-and-cross-validation

标签

machine-learning

weka

text-classification