Using Scikit's LabelEncoder correctly across multiple programs

后端未结

关注

 5  555

无人共我

The basic task that I have at hand is

a) Read some tab separated data.

b) Do some basic preprocessing

c) For each categorical column use LabelE


                      
              相关标签:


      
      
        
          5条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  别跟我提以往        
                
              
                            
                2020-12-02 17:43
              
            
            
                                                                       
For me the easiest way was exporting LabelEncoder as .pkl file for each column. You have to export the encoder for each column after using the fit_transform() function

For example

from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()    
df_train['Departure'] = le.fit_transform(df_train['Departure'])
#exporting the departure encoder
output = open('Departure_encoder.pkl', 'wb')
pickle.dump(le, output)
output.close()


Then in the testing project, you can load the LabelEncoder object and apply transform() function directly

from sklearn.preprocessing import LabelEncoder
import pandas as pd
df_test = pd.read_csv('testing_data.csv')
#load the encoder file
import pickle 
pkl_file = open('Departure_encoder.pkl', 'rb')
le_departure = pickle.load(pkl_file) 
pkl_file.close()
df_test['Departure'] = le_departure.transform(df_test['Departure'])

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  温柔的废话        
                
              
                            
                2020-12-02 17:43
              
            
            
                                                                       
What works for me is LabelEncoder().fit(X_train[col]), pickling these objects for each categorical column col and then reusing the same objects for transforming the same categorical column col in the validation dataset. Basically you have a label encoder object for each of your categorical columns. 


So fit() on training data and pickle the objects/models corresponding to each column in the training dataframe X_train. 
For each col in columns of validation set X_cv, load the corresponding object/model and apply the transformation by accessing the transform function as: transform(X_cv[col]). 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  深忆病人        
                
              
                            
                2020-12-02 17:57
              
            
            
                                                                       
You can do this after you have encoded the values with the "le" object:

encoding = {}
for i in list(le.classes_):
    encoding[i]=le.transform([i])[0]


You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  盖世英雄少女心        
                
              
                            
                2020-12-02 18:00
              
            
            
                                                                       
According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

Train:

encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)


Test

encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`


This seems more efficient than refitting it using the same data.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  清酒与你        
                
              
                            
                2020-12-02 18:06
              
            
            
                                                                       
As I found no other post about nominal/categorical encoding. I expand on the above-mentioned solutions and share mine for OrdinalEncoder approach (which maybe was intended by the author anyways)
I did the following with OrdinalEncoder (but should work with LabelEncoder as well). Note, that I am using categories_ instead of classes_

Create an Encoder dictionary
Save it with numpy
Load it with numpy
Iterate over the dict and apply the transformation on each column

Note: np stands for numpy.
# ------- step 1 and 2 in the file/cell where the encoding shall be exported

    encoder_dict = dict()

    for nom in nominal_columns:
        enc = enc.fit(df[[nom]])
        df[[nom]] = enc.transform(df[[nom]])
        encoder_dict[nom] = [[str(cat) for cat in sublist] for sublist in enc.categories_]

    np.save('FILE_NAME.npy', encoder_dict)




# ------------ step 3 and 4 in the file where encoding shall be imported

enc = OrdinalEncoder()
encoder_dict = np.load('FILE_NAME.npy', allow_pickle=True).tolist()

    for nom in encoder_dict:
        for col in df.columns:
            if nom == col:
                enc.categories_ = encoder_dict[nom]
                df[[col]] = enc.transform(df[[col]])
    return df

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复