sklearn.LabelEncoder with never seen before values

后端 未结 12 945
执笔经年
执笔经年 2020-11-27 10:37

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

The only solution I c

相关标签:
12条回答
  • 2020-11-27 11:17

    I know two devs that are working on building wrappers around transformers and Sklearn pipelines. They have 2 robust encoder transformers (one dummy and one label encoders) that can handle unseen values. Here is the documentation to their skutil library. Search for skutil.preprocessing.OneHotCategoricalEncoder or skutil.preprocessing.SafeLabelEncoder. In their SafeLabelEncoder(), unseen values are auto encoded to 999999.

    0 讨论(0)
  • 2020-11-27 11:21

    I face the same problem and realized that my encoder was somehow mixing values within my columns dataframe. Lets say that you run your encoder for several columns and when assigning numbers to labels the encoder automatically writes numbers to it and sometimes turns out that you have two different columns with similar values. What I did to solve the problem was to create an instance of LabelEncoder() for each column in my pandas DataFrame and I have a nice result.

    encoder1 = LabelEncoder()
    encoder2 = LabelEncoder()
    encoder3 = LabelEncoder()
    
    df['col1'] = encoder1.fit_transform(list(df['col1'].values))
    df['col2'] = encoder2.fit_transform(list(df['col2'].values))
    df['col3'] = encoder3.fit_transform(list(df['col3'].values))
    

    Regards!!

    0 讨论(0)
  • 2020-11-27 11:21

    If someone is still looking for it, here is my fix.

    Say you have
    enc_list : list of variables names already encoded
    enc_map : the dictionary containing variables from enc_list and corresponding encoded mapping
    df : dataframe containing values of a variable not present in enc_map

    This will work assuming you already have category "NA" or "Unknown" in the encoded values

    for l in enc_list:  
    
        old_list = enc_map[l].classes_
        new_list = df[l].unique()
        na = [j for j in new_list if j not in old_list]
        df[l] = df[l].replace(na,'NA')
    
    0 讨论(0)
  • 2020-11-27 11:32

    I get the impression that what you've done is quite similar to what other people do when faced with this situation.

    There's been some effort to add the ability to encode unseen labels to the LabelEncoder (see especially https://github.com/scikit-learn/scikit-learn/pull/3483 and https://github.com/scikit-learn/scikit-learn/pull/3599), but changing the existing behavior is actually more difficult than it seems at first glance.

    For now it looks like handling "out-of-vocabulary" labels is left to individual users of scikit-learn.

    0 讨论(0)
  • 2020-11-27 11:34

    I have created a class to support this. If you have a new label comes, this will assign it as unknown class.

    from sklearn.preprocessing import LabelEncoder
    import numpy as np
    
    
    class LabelEncoderExt(object):
        def __init__(self):
            """
            It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]
            Unknown will be added in fit and transform will take care of new item. It gives unknown class id
            """
            self.label_encoder = LabelEncoder()
            # self.classes_ = self.label_encoder.classes_
    
        def fit(self, data_list):
            """
            This will fit the encoder for all the unique values and introduce unknown value
            :param data_list: A list of string
            :return: self
            """
            self.label_encoder = self.label_encoder.fit(list(data_list) + ['Unknown'])
            self.classes_ = self.label_encoder.classes_
    
            return self
    
        def transform(self, data_list):
            """
            This will transform the data_list to id list where the new values get assigned to Unknown class
            :param data_list:
            :return:
            """
            new_data_list = list(data_list)
            for unique_item in np.unique(data_list):
                if unique_item not in self.label_encoder.classes_:
                    new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]
    
            return self.label_encoder.transform(new_data_list)
    

    The sample usage:

    country_list = ['Argentina', 'Australia', 'Canada', 'France', 'Italy', 'Spain', 'US', 'Canada', 'Argentina, ''US']
    
    label_encoder = LabelEncoderExt()
    
    label_encoder.fit(country_list)
    print(label_encoder.classes_) # you can see new class called Unknown
    print(label_encoder.transform(country_list))
    
    
    new_country_list = ['Canada', 'France', 'Italy', 'Spain', 'US', 'India', 'Pakistan', 'South Africa']
    print(label_encoder.transform(new_country_list))
    
    0 讨论(0)
  • 2020-11-27 11:34

    If it is just about training and testing a model, why not just labelencode on entire dataset. And then use the generated classes from the encoder object.

    encoder = LabelEncoder()
    encoder.fit_transform(df["label"])
    train_y = encoder.transform(train_y)
    test_y = encoder.transform(test_y)
    
    0 讨论(0)
提交回复
热议问题