Process for oversampling data for imbalanced binary classification

后端 未结 2 729
余生分开走
余生分开走 2020-12-20 03:43

I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to ba

相关标签:
2条回答
  • 2020-12-20 03:59

    From my experience this is a bad practice. As you mentioned test data should contain unseen samples so it would not overfit and give you better evaluation of training process. If you need to increase sample sizes - think about data transformation possibilities. E.g. human/cat image classification, as they are symmetric you can double sample size by mirroring images.

    0 讨论(0)
  • 2020-12-20 04:10

    I was wondering if oversampling should be done before or after splitting my data into train and test sets.

    It should certainly be done after splitting, i.e. it should be applied only to your training set, and not to your validation and test ones; see also my related answer here.

    I have generally seen it done before splitting in online examples, like this

    From the code snippet you show, it is not at all obvious that it is done before splitting, as you claim. It depends on what exactly the train variable is here: if it is the product of a train-test split, then the oversampling takes place after splitting indeed, as it should be.

    However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data.

    Exactly, this is the reason why the oversampling should be done after splitting to train-test, and not before.

    (I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...).

    I am fine doing this

    You shouldn't :)

    0 讨论(0)
提交回复
热议问题