Imbalance in scikit-learn

前端 未结 5 545
北恋
北恋 2021-01-31 08:35

I\'m using scikit-learn in my Python program in order to perform some machine-learning operations. The problem is that my data-set has severe imbalance issues.

Is anyone

相关标签:
5条回答
  • 2021-01-31 08:47

    SMOTE is not a builtin in scikit-learn, but there are implementations available online nevertheless.

    Edit: The discussion with a SMOTE implementation on GMane that I originally linked to, appears to be no longer available. The code is preserved here.

    The newer answer below, by @nos, is also quite good.

    0 讨论(0)
  • 2021-01-31 08:54

    There is a new one here

    https://github.com/scikit-learn-contrib/imbalanced-learn

    It contains many algorithms in the following categories, including SMOTE

    • Under-sampling the majority class(es).
    • Over-sampling the minority class.
    • Combining over- and under-sampling.
    • Create ensemble balanced sets.
    0 讨论(0)
  • 2021-01-31 08:58

    In Scikit learn there are some imbalance correction techniques, which vary according with which learning algorithm are you using.

    Some one of them, like Svm or logistic regression, have the class_weight parameter. If you instantiate an SVC with this parameter set on 'auto', it will weight each class example proportionally to the inverse of its frequency.

    Unfortunately, there isn't a preprocessor tool with this purpose.

    0 讨论(0)
  • 2021-01-31 09:00

    Since others have listed links to the very popular imbalanced-learn library I'll give an overview about how to properly use it along with some links.

    https://imbalanced-learn.org/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html

    https://imbalanced-learn.org/en/stable/generated/imblearn.over_sampling.RandomOverSampler.html

    https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.over_sampling.SMOTE.html

    https://imbalanced-learn.readthedocs.io/en/stable/auto_examples/over-sampling/plot_comparison_over_sampling.html#sphx-glr-auto-examples-over-sampling-plot-comparison-over-sampling-py

    https://imbalanced-learn.org/en/stable/combine.html

    Some common over-sampling and under-sampling techniques in imbalanced-learn are imblearn.over_sampling.RandomOverSampler, imblearn.under_sampling.RandomUnderSampler, and imblearn.SMOTE. For these libraries there is a nice parameter that allows the user to change the sampling ratio.

    For example, in SMOTE, to change the ratio you would input a dictionary, and all values must be greater than or equal to the largest class (since SMOTE is an over-sampling technique). The reason I have found SMOTE to be a better fit for model performance in my experience is probably because with RandomOverSampler you are duplicating rows, which means the model can start to memorize the data rather than generalize to new data. SMOTE uses the K-Nearest-Neighbors algorithm to make "similar" data points to those under sampled ones.

    It is not good practice to blindly use SMOTE, setting the ratio to it's default (even class balance) because the model may overfit one or more of the minority classes (even though SMOTE is using nearest neighbors to make "similar" observations). In a similar way that you tune hyperparameters of a ML model you will tune the hyperparameters of the SMOTE algorithm, such as the ratio and/or knn. Below is a working example of how to properly use SMOTE.

    NOTE: It is vital that you do not use SMOTE on the full data set. You MUST use SMOTE on the training set only (after you split). Then validate on your val/test sets and see if your SMOTE model out performed your other model(s). If you do not do this there will be data leakage and your model is essentially cheating.

    from collections import Counter
    from imblearn.pipeline import Pipeline
    from imblearn.over_sampling import SMOTE
    import numpy as np
    from xgboost import XGBClassifier
    import warnings
    
    warnings.filterwarnings(action='ignore', category=DeprecationWarning)
    sm = SMOTE(random_state=0, n_jobs=8, ratio={'class1':100, 'class2':100, 'class3':80, 'class4':60, 'class5':90})
    X_resampled, y_resampled = sm.fit_sample(X_normalized, y)
    
    print('Original dataset shape:', Counter(y))
    print('Resampled dataset shape:', Counter(y_resampled))
    
    X_train_smote, X_test_smote, y_train_smote, y_test_smote = train_test_split(X_resampled, y_resampled)
    X_train_smote.shape, X_test_smote.shape, y_train_smote.shape, y_test_smote.shape, X_resampled.shape, y_resampled.shape
    
    smote_xgbc = XGBClassifier(n_jobs=8).fit(X_train_smote, y_train_smote)
    
    print('TRAIN')
    print(accuracy_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
    print(f1_score(smote_xgbc.predict(np.array(X_train_normalized)), y_train))
    
    print('TEST')
    print(accuracy_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
    print(f1_score(smote_xgbc.predict(np.array(X_test_normalized)), y_test))
    
    0 讨论(0)
  • 2021-01-31 09:07

    I found one other library here which implements undersampling and also multiple oversampling techniques including multiple SMOTE implementations and another which uses SVM:

    A Python Package to Tackle the Curse of Imbalanced Datasets in Machine Learning

    0 讨论(0)
提交回复
热议问题