XGBoost for multilabel classification?

前端未结

关注

 3  944

Is it possible to use XGBoost for multi-label classification? Now I use OneVsRestClassifier over GradientBoostingClassifier from sklearn


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  鱼传尺愫        
                
              
                            
                2020-12-30 03:31
              
            
            
                                                                       
There are a couple of ways to do that, one of which is the one you already suggested:

1.

from xgboost import XGBClassifier
from sklearn.multiclass import OneVsRestClassifier
# If you want to avoid the OneVsRestClassifier magic switch
# from sklearn.multioutput import MultiOutputClassifier

clf_multilabel = OneVsRestClassifier(XGBClassifier(**params))


clf_multilabel will fit one binary classifier per class, and it will use however many cores you specify in params (fyi, you can also specify n_jobs in OneVsRestClassifier, but that eats up more memory).

2.
If you first massage your data a little by making k copies of every data point that has k correct labels, you can hack your way to a simpler multiclass problem. At that point, just

clf = XGBClassifier(**params)
clf.fit(train_data)
pred_proba = clf.predict_proba(test_data)


to get classification margins/probabilities for each class and decide what threshold you want for predicting a label.
Note that this solution is not exact: if a product has tags (1, 2, 3), you artificially introduce two negative samples for each class.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  无人共我        
                
              
                            
                2020-12-30 03:31
              
            
            
                                                                       
One possible approach, instead of using OneVsRestClassifier which is for multi-class tasks, is to use MultiOutputClassifier from the sklearn.multioutput module.

Below is a small reproducible sample code with the number of input features and target outputs requested by the OP

import xgboost as xgb
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score

# create sample dataset
X, y = make_multilabel_classification(n_samples=3000, n_features=45, n_classes=20, n_labels=1,
                                      allow_unlabeled=False, random_state=42)

# split dataset into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# create XGBoost instance with default hyper-parameters
xgb_estimator = xgb.XGBClassifier(objective='binary:logistic')

# create MultiOutputClassifier instance with XGBoost model inside
multilabel_model = MultiOutputClassifier(xgb_estimator)

# fit the model
multilabel_model.fit(X_train, y_train)

# evaluate on test data
print('Accuracy on test data: {:.1f}%'.format(accuracy_score(y_test, multilabel_model.predict(X_test))*100))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  走了就别回头了        
                
              
                            
                2020-12-30 03:35
              
            
            
                                                                       
You can add a label to each class you want to predict. 
For example if this is your data:

X1 X2 X3 X4  Y1 Y2 Y3
 1  3  4  6   7  8  9
 2  5  5  5   5  3  2


You can simply reshape your data by adding a label to the input, according to the output, and xgboost should learn how to treat it accordingly, like so:

X1 X2 X3 X3 X_label Y
 1  3  4  6   1     7
 2  5  5  5   1     5
 1  3  4  6   2     8
 2  5  5  5   2     3
 1  3  4  6   3     9
 2  5  5  5   3     2


This way you will have a 1-dimensional Y, but you can still predict many labels. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复