sklearn.LabelEncoder with never seen before values

后端未结

关注

 12  945

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

The only solution I c

相关标签:

12条回答

一向

2020-11-27 11:17

I know two devs that are working on building wrappers around transformers and Sklearn pipelines. They have 2 robust encoder transformers (one dummy and one label encoders) that can handle unseen values. Here is the documentation to their skutil library. Search for skutil.preprocessing.OneHotCategoricalEncoder or skutil.preprocessing.SafeLabelEncoder. In their SafeLabelEncoder(), unseen values are auto encoded to 999999.

0 讨论(0)
发布评论:

提交评论
- 加载中...
花落未央

2020-11-27 11:21
I face the same problem and realized that my encoder was somehow mixing values within my columns dataframe. Lets say that you run your encoder for several columns and when assigning numbers to labels the encoder automatically writes numbers to it and sometimes turns out that you have two different columns with similar values. What I did to solve the problem was to create an instance of LabelEncoder() for each column in my pandas DataFrame and I have a nice result.
```
encoder1 = LabelEncoder()
encoder2 = LabelEncoder()
encoder3 = LabelEncoder()

df['col1'] = encoder1.fit_transform(list(df['col1'].values))
df['col2'] = encoder2.fit_transform(list(df['col2'].values))
df['col3'] = encoder3.fit_transform(list(df['col3'].values))
```
Regards!!
0 讨论(0)
发布评论:

提交评论
- 加载中...
青春惊慌失措

2020-11-27 11:21
If someone is still looking for it, here is my fix.

Say you have
enc_list : list of variables names already encoded
enc_map : the dictionary containing variables from enc_list and corresponding encoded mapping
df : dataframe containing values of a variable not present in enc_map

This will work assuming you already have category "NA" or "Unknown" in the encoded values
```
for l in enc_list:  

    old_list = enc_map[l].classes_
    new_list = df[l].unique()
    na = [j for j in new_list if j not in old_list]
    df[l] = df[l].replace(na,'NA')
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
孤独总比滥情好

2020-11-27 11:32

I get the impression that what you've done is quite similar to what other people do when faced with this situation.

There's been some effort to add the ability to encode unseen labels to the LabelEncoder (see especially https://github.com/scikit-learn/scikit-learn/pull/3483 and https://github.com/scikit-learn/scikit-learn/pull/3599), but changing the existing behavior is actually more difficult than it seems at first glance.

For now it looks like handling "out-of-vocabulary" labels is left to individual users of scikit-learn.

0 讨论(0)
发布评论:

提交评论
- 加载中...

离开以前

2020-11-27 11:34

I have created a class to support this. If you have a new label comes, this will assign it as unknown class.

from sklearn.preprocessing import LabelEncoder
import numpy as np


class LabelEncoderExt(object):
    def __init__(self):
        """
        It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]
        Unknown will be added in fit and transform will take care of new item. It gives unknown class id
        """
        self.label_encoder = LabelEncoder()
        # self.classes_ = self.label_encoder.classes_

    def fit(self, data_list):
        """
        This will fit the encoder for all the unique values and introduce unknown value
        :param data_list: A list of string
        :return: self
        """
        self.label_encoder = self.label_encoder.fit(list(data_list) + ['Unknown'])
        self.classes_ = self.label_encoder.classes_

        return self

    def transform(self, data_list):
        """
        This will transform the data_list to id list where the new values get assigned to Unknown class
        :param data_list:
        :return:
        """
        new_data_list = list(data_list)
        for unique_item in np.unique(data_list):
            if unique_item not in self.label_encoder.classes_:
                new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]

        return self.label_encoder.transform(new_data_list)

The sample usage:

country_list = ['Argentina', 'Australia', 'Canada', 'France', 'Italy', 'Spain', 'US', 'Canada', 'Argentina, ''US']

label_encoder = LabelEncoderExt()

label_encoder.fit(country_list)
print(label_encoder.classes_) # you can see new class called Unknown
print(label_encoder.transform(country_list))


new_country_list = ['Canada', 'France', 'Italy', 'Spain', 'US', 'India', 'Pakistan', 'South Africa']
print(label_encoder.transform(new_country_list))

0 讨论(0)

忘掉有多难

2020-11-27 11:34
If it is just about training and testing a model, why not just labelencode on entire dataset. And then use the generated classes from the encoder object.
```
encoder = LabelEncoder()
encoder.fit_transform(df["label"])
train_y = encoder.transform(train_y)
test_y = encoder.transform(test_y)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2