How to generate a custom cross-validation generator in scikit-learn?

前端未结

关注

 4  1816

I have an unbalanced dataset, so I have an strategy for oversampling that I only apply during training of my data. I\'d like to use classes of scikit-learn like GridSearch


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2021-01-31 20:33
              
            
            
                                                                       
Scikit-Learn provides a workaround for this, with their Label k-fold iterator:


  LabelKFold is a variation of k-fold which ensures that the same label is not in both testing and training sets. This is necessary for example if you obtained data from different subjects and you want to avoid over-fitting (i.e., learning person specific features) by testing and training on different subjects.


To use this iterator in a case of oversampling, first, you can create a column in your dataframe (e.g. cv_label) which stores the index values of each row.

df['cv_label'] = df.index


Then, you can apply your oversampling, making sure you copy the cv_label column in the oversampling as well. This column will contain duplicate values for the oversampled data. You can create a separate series or list from these labels for handling later:

cv_labels = df['cv_label']


Be aware that you will need to remove this column from your dataframe before running your cross-validator/classifier.

After separating your data into features (not including cv_label) and labels, you create the LabelKFold iterator and run the cross validation function you need with it:

clf = svm.SVC(C=1)
lkf = LabelKFold(cv_labels, n_folds=5)
predicted = cross_validation.cross_val_predict(clf, features, labels, cv=lkf)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一个人的身影        
                
              
                            
                2021-01-31 20:41
              
            
            
                                                                       
The cross-validation generator returns an iterable of length n_folds, each element of which is a 2-tuple of numpy 1-d arrays (train_index, test_index) containing the indices of the test and training sets for that cross-validation run.  

So for 10-fold cross-validation, your custom cross-validation generator needs to contain 10 elements, each of which contains a tuple with two elements: 


An array of the indices for the training subset for that run, covering 90% of your data
An array of the indices for the testing subset for that run, covering 10% of the data


I was working on a similar problem in which I created integer labels for the different folds of my data. My dataset is stored in a Pandas dataframe myDf which has the column cvLabel for the cross-validation labels.  I construct the custom cross-validation generator myCViterator as follows:

myCViterator = []
for i in range(nFolds):
    trainIndices = myDf[ myDf['cvLabel']!=i ].index.values.astype(int)
    testIndices =  myDf[ myDf['cvLabel']==i ].index.values.astype(int)
    myCViterator.append( (trainIndices, testIndices) )

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  我寻月下人不归        
                
              
                            
                2021-01-31 20:51
              
            
            
                                                                       
class own_custom_CrossValidator:#like those in source sklearn/model_selection/_split.py 
    def init(self):#coordinates,meter 
        pass # self.coordinates = coordinates # self.meter = meter 
    def split(self,X,y=None,groups=None):
    #for compatibility with #cross_val_predict,cross_val_score 
        for i in range(0,len(X)): yield tuple((np.array(list(range(0,len(X))))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  暖寄归人        
                
              
                            
                2021-01-31 20:52
              
            
            
                                                                       
I had a similar problem and this quick hack is working for me:

class UpsampleStratifiedKFold:
    def __init__(self, n_splits=3):
        self.n_splits = n_splits

    def split(self, X, y, groups=None):
        for rx, tx in StratifiedKFold(n_splits=self.n_splits).split(X,y):
            nix = np.where(y[rx]==0)[0]
            pix = np.where(y[rx]==1)[0]
            pixu = np.random.choice(pix, size=nix.shape[0], replace=True)
            ix = np.append(nix, pixu)
            rxm = rx[ix]
            yield rxm, tx

    def get_n_splits(self, X, y, groups=None):
        return self.n_splits


This upsamples (with replacement) the minority class for a balanced (k-1)-fold training set, but leaves kth test set unbalanced.  This appears to play well with sklearn.model_selection.GridSearchCV and other similar classes requiring a CV generator.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复