UserWarning: Label not :NUMBER: is present in all training examples

后端未结

关注

 2  565

I am doing multilabel classification, where I try to predict correct labels for each document and here is my code:

mlb = MultiLabelBinarizer()
X = dataframe[


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2021-01-04 09:52
              
            
            
                                                                       

  Why is this happening, is it related to warnings it prints out while training is running?


The issue is likely to be that some tags occur just in a few documents (check out this thread for details). When you split the dataset into train and test to validate your model, it may happen that some tags are missing from the training data. Let train_indices be an array with the indices of the training samples. If a particular tag (of index k) does not occur in the training sample, all the elements in the k-th column of the indicator matrix y[train_indices] are zeros. 


  How can I avoid those empty predictions?  


In the scenario described above the classifier will not be able to reliably predict the k-th tag in the test documents (more on this in the next paragraph). Therefore you cannot trust the predictions made by clf.predict and you need to implement the prediction function on your own, for example by using the decision values returned by clf.decision_function as suggested in this answer.


  So I don't know why is this label not chosen with binary decisioning? Or is binary decisioning evaluated in different way than probabilities?


In datasets containing many labels the occurrence frequency for most of them uses to be rather low. If these low values are fed to a binary classifier (i.e. a classifier that makes a 0-1 prediction) it is highly probable that the classifier would pick 0 for all tags on all documents.


  I have found an old post where OP was dealing with similar problem. Is this the same case?


Yes, absolutely. That guy is facing exactly the same problem as you and his code is pretty similar to yours.



Demo

To further explain the issue I have elaborated a simple toy example using mock data.

Q = {'What does the "yield" keyword do in Python?': ['python'],
     'What is a metaclass in Python?': ['oop'],
     'How do I check whether a file exists using Python?': ['python'],
     'How to make a chain of function decorators?': ['python', 'decorator'],
     'Using i and j as variables in Matlab': ['matlab', 'naming-conventions'],
     'MATLAB: get variable type': ['matlab'],
     'Why is MATLAB so fast in matrix multiplication?': ['performance'],
     'Is MATLAB OOP slow or am I doing something wrong?': ['matlab-oop'],
    }
dataframe = pd.DataFrame({'body': Q.keys(), 'tag': Q.values()})    

mlb = MultiLabelBinarizer()
X = dataframe['body'].values 
y = mlb.fit_transform(dataframe['tag'].values)

classifier = Pipeline([
    ('vectorizer', CountVectorizer(lowercase=True, 
                                   stop_words='english', 
                                   max_df=0.8, 
                                   min_df=1)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(LinearSVC()))])


Please, notice that I have set min_df=1 since my dataset is much smaller than yours. When I run the following sentence:

predicted = cross_val_predict(classifier, X, y)


I get a bunch of warnings

C:\...\multiclass.py:76: UserWarning: Label not 4 is present in all training examples.
  str(classes[c]))
C:\\multiclass.py:76: UserWarning: Label not 0 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 3 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 5 is present in all training examples.
  str(classes[c]))
C:\...\multiclass.py:76: UserWarning: Label not 2 is present in all training examples.
  str(classes[c]))


and the following prediction:

In [5]: np.set_printoptions(precision=2, threshold=1000)    

In [6]: predicted
Out[6]: 
array([[0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 1],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0]])


Those rows whose entries are all 0 indicate that no tag is predicted for the corresponding document.



Workaround

For the sake of the analysis, let us validate the model manually rather than through cross_val_predict. 

import warnings
from sklearn.model_selection import ShuffleSplit

rs = ShuffleSplit(n_splits=1, test_size=.5, random_state=0)
train_indices, test_indices = rs.split(X).next()

with warnings.catch_warnings(record=True) as received_warnings:
    warnings.simplefilter("always")
    X_train, y_train = X[train_indices], y[train_indices]
    X_test, y_test = X[test_indices], y[test_indices]
    classifier.fit(X_train, y_train)
    predicted_test = classifier.predict(X_test)
    for w in received_warnings:
        print w.message


When the snippet above is executed two warnings are issued (I used a context manager to make sure warnings are catched):

Label not 2 is present in all training examples.
Label not 4 is present in all training examples.


This is consistent with the fact that tags of indices 2 and 4 are missing from the training samples:

In [40]: y_train
Out[40]: 
array([[0, 0, 0, 0, 0, 1, 0],
       [0, 1, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0],
       [1, 0, 0, 0, 0, 0, 1]])


For some documents, the prediction is empty (those documents corresponding to the rows with all zeros in predicted_test):

In [42]: predicted_test
Out[42]: 
array([[0, 1, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0, 0, 0],
       [0, 1, 0, 1, 0, 0, 0]])


To overcome that issue, you could implement your own prediction function like this:

def get_best_tags(clf, X, lb, n_tags=3):
    decfun = clf.decision_function(X)
    best_tags = np.argsort(decfun)[:, :-(n_tags+1): -1]
    return lb.classes_[best_tags]


By doing so, each document is always assigned the n_tag tags with the highest confidence score:

In [59]: mlb.inverse_transform(predicted_test)
Out[59]: [('matlab',), (), (), ('matlab', 'naming-conventions')]

In [60]: get_best_tags(classifier, X_test, mlb)
Out[60]: 
array([['matlab', 'oop', 'matlab-oop'],
       ['oop', 'matlab-oop', 'matlab'],
       ['oop', 'matlab-oop', 'matlab'],
       ['matlab', 'naming-conventions', 'oop']], dtype=object)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  不要未来只要你来        
                
              
                            
                2021-01-04 10:13
              
            
            
                                                                       
I too had the same error. Then I used LabelEncoder() instead of MultiLabelBinarizer() to encode the labels.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
y = le.fit_transform(Labels)


I am not getting that error anymore.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复