I\'m using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.
Is anyone
SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.
Edit: The discussion with a SMOTE implementation on GMane that I originally linked to, appears to be no longer available. The code is preserved here.
The newer answer below, by @nos, is also quite good.
There is a new one here
https://github.com/scikit-learn-contrib/imbalanced-learn
It contains many algorithms in the following categories, including SMOTE
In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.
Some one of them, like Svm or logistic regression, have the class_weight
parameter. If you instantiate an SVC
with this parameter set on 'auto'
, it will weight each class example proportionally to the inverse of its frequency.
Unfortunately, there isn't a preprocessor tool with this purpose.
Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.
https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html
https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html
https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html
https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py
https://imbalanced-learn.org/en/stable/combine.html
Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.
For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance in my experience is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.
It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.
NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
from collections import Counter
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
X_resampled, y_resampled = sm.fit_sample(X_normalized, y)
print('Original dataset shape:', Counter(y))
print('Resampled dataset shape:', Counter(y_resampled))
X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled)
X_train_smote.shape, X_test_smote.shape, y_train_smote.shape, y_test_smote.shape, X_resampled.shape, y_resampled.shape
smote_xgbc = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote)
print('TRAIN')
print(accuracy_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print(f1_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print('TEST')
print(accuracy_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
print(f1_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE
implementations and another which uses SVM
:
A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning