Process for oversampling data for imbalanced binary classification

问题

I have about a 30% and 70% for class 0 (minority class) and class 1 (majority class). Since I do not have a lot of data, I am planning to oversample the minority class to balance out the classes to become a 50-50 split. I was wondering if oversampling should be done before or after splitting my data into train and test sets. I have generally seen it done before splitting in online examples, like this:

df_class0 = train[train.predict_var == 0]
df_class1 = train[train.predict_var == 1]
df_class1_over = df_class1.sample(len(df_class0), replace=True)
df_over = pd.concat([df_class0, df_class1_over], axis=0)

However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data. I am fine doing this, but I would like to know what is considered good practice. Thank you!

回答1:

I was wondering if oversampling should be done before or after splitting my data into train and test sets.

It should certainly be done after splitting, i.e. it should be applied only to your training set, and not to your validation and test ones; see also my related answer here.

I have generally seen it done before splitting in online examples, like this

From the code snippet you show, it is not at all obvious that it is done before splitting, as you claim. It depends on what exactly the train variable is here: if it is the product of a train-test split, then the oversampling takes place after splitting indeed, as it should be.

However, wouldn't that mean that the test data will likely have duplicated samples from the training set (because we have oversampled the training set)? This means that testing performance wouldn't necessarily be on new, unseen data.

Exactly, this is the reason why the oversampling should be done after splitting to train-test, and not before.

(I once witnessed a case where the modeller was struggling to understand why he was getting a ~ 100% test accuracy, much higher than his training one; turned out his initial dataset was full of duplicates -no class imbalance here, but the idea is similar- and several of these duplicates naturally ended up in his test set after the split, without of course being new or unseen data...).

I am fine doing this

You shouldn't :)

回答2:

From my experience this is a bad practice. As you mentioned test data should contain unseen samples so it would not overfit and give you better evaluation of training process. If you need to increase sample sizes - think about data transformation possibilities. E.g. human/cat image classification, as they are symmetric you can double sample size by mirroring images.

来源：https://stackoverflow.com/questions/51064462/process-for-oversampling-data-for-imbalanced-binary-classification

标签

machine-learning

scikit-learn

classification

train-test-split

imbalanced-data