XGBoost for multilabel classification?

前端 未结 3 944
一整个雨季
一整个雨季 2020-12-30 03:10

Is it possible to use XGBoost for multi-label classification? Now I use OneVsRestClassifier over GradientBoostingClassifier from sklearn

相关标签:
3条回答
  • 2020-12-30 03:31

    There are a couple of ways to do that, one of which is the one you already suggested:

    1.

    from xgboost import XGBClassifier
    from sklearn.multiclass import OneVsRestClassifier
    # If you want to avoid the OneVsRestClassifier magic switch
    # from sklearn.multioutput import MultiOutputClassifier
    
    clf_multilabel = OneVsRestClassifier(XGBClassifier(**params))
    

    clf_multilabel will fit one binary classifier per class, and it will use however many cores you specify in params (fyi, you can also specify n_jobs in OneVsRestClassifier, but that eats up more memory).

    2. If you first massage your data a little by making k copies of every data point that has k correct labels, you can hack your way to a simpler multiclass problem. At that point, just

    clf = XGBClassifier(**params)
    clf.fit(train_data)
    pred_proba = clf.predict_proba(test_data)
    

    to get classification margins/probabilities for each class and decide what threshold you want for predicting a label. Note that this solution is not exact: if a product has tags (1, 2, 3), you artificially introduce two negative samples for each class.

    0 讨论(0)
  • 2020-12-30 03:31

    One possible approach, instead of using OneVsRestClassifier which is for multi-class tasks, is to use MultiOutputClassifier from the sklearn.multioutput module.

    Below is a small reproducible sample code with the number of input features and target outputs requested by the OP

    import xgboost as xgb
    from sklearn.datasets import make_multilabel_classification
    from sklearn.model_selection import train_test_split
    from sklearn.multioutput import MultiOutputClassifier
    from sklearn.metrics import accuracy_score
    
    # create sample dataset
    X, y = make_multilabel_classification(n_samples=3000, n_features=45, n_classes=20, n_labels=1,
                                          allow_unlabeled=False, random_state=42)
    
    # split dataset into training and test set
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
    
    # create XGBoost instance with default hyper-parameters
    xgb_estimator = xgb.XGBClassifier(objective='binary:logistic')
    
    # create MultiOutputClassifier instance with XGBoost model inside
    multilabel_model = MultiOutputClassifier(xgb_estimator)
    
    # fit the model
    multilabel_model.fit(X_train, y_train)
    
    # evaluate on test data
    print('Accuracy on test data: {:.1f}%'.format(accuracy_score(y_test, multilabel_model.predict(X_test))*100))
    
    0 讨论(0)
  • 2020-12-30 03:35

    You can add a label to each class you want to predict. For example if this is your data:

    X1 X2 X3 X4  Y1 Y2 Y3
     1  3  4  6   7  8  9
     2  5  5  5   5  3  2
    

    You can simply reshape your data by adding a label to the input, according to the output, and xgboost should learn how to treat it accordingly, like so:

    X1 X2 X3 X3 X_label Y
     1  3  4  6   1     7
     2  5  5  5   1     5
     1  3  4  6   2     8
     2  5  5  5   2     3
     1  3  4  6   3     9
     2  5  5  5   3     2
    

    This way you will have a 1-dimensional Y, but you can still predict many labels.

    0 讨论(0)
提交回复
热议问题