Imbalance in scikit-learn

前端未结

关注

 5  562

I\'m using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.

Is anyone

相关标签:

5条回答

孤城傲影

2021-01-31 08:47

SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.

Edit: The discussion with a SMOTE implementation on GMane that I originally linked to, appears to be no longer available. The code is preserved here.

The newer answer below, by @nos, is also quite good.

0 讨论(0)
发布评论:

提交评论
- 加载中...
梦谈多话

2021-01-31 08:54
There is a new one here

https://github.com/scikit-learn-contrib/imbalanced-learn

It contains many algorithms in the following categories, including SMOTE
- Under-sampling the majority class(es).
- Over-sampling the minority class.
- Combining over- and under-sampling.
- Create ensemble balanced sets.
0 讨论(0)
发布评论:

提交评论
- 加载中...
鱼传尺愫

2021-01-31 08:58

In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.

Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'auto', it will weight each class example proportionally to the inverse of its frequency.

Unfortunately, there isn't a preprocessor tool with this purpose.

0 讨论(0)
发布评论:

提交评论
- 加载中...
Happy的楠姐

2021-01-31 09:00
Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.

https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html

https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html

https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py

https://imbalanced-learn.org/en/stable/combine.html

Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.

For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance in my experience is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.

It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.

NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.
```
from collections import Counter
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import numpy as np
from xgboost import XGBClassifier
import warnings

warnings.filterwarnings(action='ignore', category=DeprecationWarning)
sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
X_resampled, y_resampled = sm.fit_sample(X_normalized, y)

print('Original dataset shape:', Counter(y))
print('Resampled dataset shape:', Counter(y_resampled))

X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled)
X_train_smote.shape, X_test_smote.shape, y_train_smote.shape, y_test_smote.shape, X_resampled.shape, y_resampled.shape

smote_xgbc = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote)

print('TRAIN')
print(accuracy_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
print(f1_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))

print('TEST')
print(accuracy_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
print(f1_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
长情又很酷

2021-01-31 09:07

I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:

A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

0 讨论(0)
发布评论:

提交评论
- 加载中...