XGBoost difference in train and test features after converting to DMatrix

后端未结

关注

 3  1161

Just wondering how is possible next case:

 def fit(self, train, target):
     xgtrain = xgb.DMatrix(train, label=target, missing=np.nan)
     self.model = xg


                      
              相关标签:


      
      
        
          3条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  花落未央        
                
              
                            
                2021-01-03 05:48
              
            
            
                                                                       
This situation can happen after one-hot encoding. For example,

ar = np.array([
        [1, 2],
        [1, 0]
])

enc = OneHotEncoder().fit(ar)
ar2 = enc.transform(ar)
b = np.array([[1, 0]])
b2 = enc.transform(b)
xgb_ar = xgb.DMatrix(ar2)
xgb_b = xgb.DMatrix(b2)

print(b2.shape) # (1, 3)
print(xgb_b.num_col()) # 2


So, when you have all zero column in sparse matrix, DMatrix drop this column (I think, because this column is useless for XGBoost)

Usually, I add a fake row to matrix which contents 1 in all columns.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  星月不相逢        
                
              
                            
                2021-01-03 05:57
              
            
            
                                                                       
Such an issue occurred for me when RandomUnderSampler (RUS) method returned a np.array rather than a Pandas DataFrame with column names.

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X_train, y_train)


I resolved the issue with this:

X_rus = pd.DataFrame(X_rus, columns = X_train.columns)


Basically taking the output of RUS method and creating a Pandas DataFrame out of it with column names from the original X_train data which was the input of RUS method.

This can be generalized to any similar problem where XGBoost expected to read column names but could not. Just create a Pandas DataFrame and assign the column names accordingly.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  日久生厌        
                
              
                            
                2021-01-03 05:59
              
            
            
                                                                       
One another possibility is to have one feature level exclusively in training data not in testing data. This situation happens mostly while post one hot encoding whose resultant is big matrix have level for each level of categorical features. In your case it looks like "f5232" is either exclusive in training or test data. If either case model scoring likely to throw error (in most implementations of ML packages) because:


If exclusive to training: Model object will have reference of this feature in model equation. While scoring it will throw error saying I am not able to find this column.
If exclusive to test (lesser likely as test data is usually smaller than training data): Model object will NOT have reference of this feature in model equation. While scoring it will throw error saying I got this column but model equation don't have this column. This is also lesser likely because most implementations are cognizant of this case.


Solutions:


The best "automated" solution is to keep only those columns, which are common to both training and test post one hot encoding. 
For adhoc analysis if you can not afford to drop the level of feature because of its importance then do stratified sampling to ensure that all level of feature gets distributed to training and test data.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复