Using Scikit's LabelEncoder correctly across multiple programs

后端 未结 5 555
无人共我
无人共我 2020-12-02 17:32

The basic task that I have at hand is

a) Read some tab separated data.

b) Do some basic preprocessing

c) For each categorical column use LabelE

相关标签:
5条回答
  • 2020-12-02 17:43

    For me the easiest way was exporting LabelEncoder as .pkl file for each column. You have to export the encoder for each column after using the fit_transform() function

    For example

    from sklearn.preprocessing import LabelEncoder
    import pickle
    import pandas as pd
    df_train = pd.read_csv('traing_data.csv')
    le = LabelEncoder()    
    df_train['Departure'] = le.fit_transform(df_train['Departure'])
    #exporting the departure encoder
    output = open('Departure_encoder.pkl', 'wb')
    pickle.dump(le, output)
    output.close()
    

    Then in the testing project, you can load the LabelEncoder object and apply transform() function directly

    from sklearn.preprocessing import LabelEncoder
    import pandas as pd
    df_test = pd.read_csv('testing_data.csv')
    #load the encoder file
    import pickle 
    pkl_file = open('Departure_encoder.pkl', 'rb')
    le_departure = pickle.load(pkl_file) 
    pkl_file.close()
    df_test['Departure'] = le_departure.transform(df_test['Departure'])
    
    0 讨论(0)
  • 2020-12-02 17:43

    What works for me is LabelEncoder().fit(X_train[col]), pickling these objects for each categorical column col and then reusing the same objects for transforming the same categorical column col in the validation dataset. Basically you have a label encoder object for each of your categorical columns.

    1. So fit() on training data and pickle the objects/models corresponding to each column in the training dataframe X_train.
    2. For each col in columns of validation set X_cv, load the corresponding object/model and apply the transformation by accessing the transform function as: transform(X_cv[col]).
    0 讨论(0)
  • 2020-12-02 17:57

    You can do this after you have encoded the values with the "le" object:

    encoding = {}
    for i in list(le.classes_):
        encoding[i]=le.transform([i])[0]
    

    You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.

    0 讨论(0)
  • 2020-12-02 18:00

    According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit LabelEncoders at the test time with data that have exactly the same set of unique values.

    There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder has only one property, namely, classes_. You can pickle it, and then restore like

    Train:

    encoder = LabelEncoder()
    encoder.fit(X)
    numpy.save('classes.npy', encoder.classes_)
    

    Test

    encoder = LabelEncoder()
    encoder.classes_ = numpy.load('classes.npy')
    # Now you should be able to use encoder
    # as you would do after `fit`
    

    This seems more efficient than refitting it using the same data.

    0 讨论(0)
  • 2020-12-02 18:06

    As I found no other post about nominal/categorical encoding. I expand on the above-mentioned solutions and share mine for OrdinalEncoder approach (which maybe was intended by the author anyways)

    I did the following with OrdinalEncoder (but should work with LabelEncoder as well). Note, that I am using categories_ instead of classes_

    1. Create an Encoder dictionary
    2. Save it with numpy
    3. Load it with numpy
    4. Iterate over the dict and apply the transformation on each column

    Note: np stands for numpy.

    # ------- step 1 and 2 in the file/cell where the encoding shall be exported
    
        encoder_dict = dict()
    
        for nom in nominal_columns:
            enc = enc.fit(df[[nom]])
            df[[nom]] = enc.transform(df[[nom]])
            encoder_dict[nom] = [[str(cat) for cat in sublist] for sublist in enc.categories_]
    
        np.save('FILE_NAME.npy', encoder_dict)
    
    
    
    
    # ------------ step 3 and 4 in the file where encoding shall be imported
    
    enc = OrdinalEncoder()
    encoder_dict = np.load('FILE_NAME.npy', allow_pickle=True).tolist()
    
        for nom in encoder_dict:
            for col in df.columns:
                if nom == col:
                    enc.categories_ = encoder_dict[nom]
                    df[[col]] = enc.transform(df[[col]])
        return df
    
    0 讨论(0)
提交回复
热议问题