If a sklearn.LabelEncoder
has been fitted on a training set, it might break if it encounters new values when used on a test set.
The only solution I c
I know two devs that are working on building wrappers around transformers and Sklearn pipelines. They have 2 robust encoder transformers (one dummy and one label encoders) that can handle unseen values. Here is the documentation to their skutil library. Search for skutil.preprocessing.OneHotCategoricalEncoder
or skutil.preprocessing.SafeLabelEncoder
. In their SafeLabelEncoder()
, unseen values are auto encoded to 999999.
I face the same problem and realized that my encoder was somehow mixing values within my columns dataframe. Lets say that you run your encoder for several columns and when assigning numbers to labels the encoder automatically writes numbers to it and sometimes turns out that you have two different columns with similar values. What I did to solve the problem was to create an instance of LabelEncoder() for each column in my pandas DataFrame and I have a nice result.
encoder1 = LabelEncoder()
encoder2 = LabelEncoder()
encoder3 = LabelEncoder()
df['col1'] = encoder1.fit_transform(list(df['col1'].values))
df['col2'] = encoder2.fit_transform(list(df['col2'].values))
df['col3'] = encoder3.fit_transform(list(df['col3'].values))
Regards!!
If someone is still looking for it, here is my fix.
Say you have
enc_list
: list of variables names already encoded
enc_map
: the dictionary containing variables from enc_list
and corresponding encoded mapping
df
: dataframe containing values of a variable not present in enc_map
This will work assuming you already have category "NA" or "Unknown" in the encoded values
for l in enc_list:
old_list = enc_map[l].classes_
new_list = df[l].unique()
na = [j for j in new_list if j not in old_list]
df[l] = df[l].replace(na,'NA')
I get the impression that what you've done is quite similar to what other people do when faced with this situation.
There's been some effort to add the ability to encode unseen labels to the LabelEncoder (see especially https://github.com/scikit-learn/scikit-learn/pull/3483 and https://github.com/scikit-learn/scikit-learn/pull/3599), but changing the existing behavior is actually more difficult than it seems at first glance.
For now it looks like handling "out-of-vocabulary" labels is left to individual users of scikit-learn.
I have created a class to support this. If you have a new label comes, this will assign it as unknown class.
from sklearn.preprocessing import LabelEncoder
import numpy as np
class LabelEncoderExt(object):
def __init__(self):
"""
It differs from LabelEncoder by handling new classes and providing a value for it [Unknown]
Unknown will be added in fit and transform will take care of new item. It gives unknown class id
"""
self.label_encoder = LabelEncoder()
# self.classes_ = self.label_encoder.classes_
def fit(self, data_list):
"""
This will fit the encoder for all the unique values and introduce unknown value
:param data_list: A list of string
:return: self
"""
self.label_encoder = self.label_encoder.fit(list(data_list) + ['Unknown'])
self.classes_ = self.label_encoder.classes_
return self
def transform(self, data_list):
"""
This will transform the data_list to id list where the new values get assigned to Unknown class
:param data_list:
:return:
"""
new_data_list = list(data_list)
for unique_item in np.unique(data_list):
if unique_item not in self.label_encoder.classes_:
new_data_list = ['Unknown' if x==unique_item else x for x in new_data_list]
return self.label_encoder.transform(new_data_list)
The sample usage:
country_list = ['Argentina', 'Australia', 'Canada', 'France', 'Italy', 'Spain', 'US', 'Canada', 'Argentina, ''US']
label_encoder = LabelEncoderExt()
label_encoder.fit(country_list)
print(label_encoder.classes_) # you can see new class called Unknown
print(label_encoder.transform(country_list))
new_country_list = ['Canada', 'France', 'Italy', 'Spain', 'US', 'India', 'Pakistan', 'South Africa']
print(label_encoder.transform(new_country_list))
If it is just about training and testing a model, why not just labelencode on entire dataset. And then use the generated classes from the encoder object.
encoder = LabelEncoder()
encoder.fit_transform(df["label"])
train_y = encoder.transform(train_y)
test_y = encoder.transform(test_y)