问题
I'm working with text data where a lot of user error has to be accounted for, eg. there are a lot of cases where upon predicting new data, new labels will occur that the encoder hasn't seen before due to typos etc. I just want to ignore these (so when I run labelencoder.transform(df_newdata['GL_Description'])
, I just want it to ignore anything it hasn't seen before). How can I do this? I didn't find a parameter for this in the docs, but is the only way really to check every word one-by-one "by hand" and drop them? Is there a way I can tell the encoder to ignore any new labels that are not in its dictionary?
回答1:
For that you can override the original LabelEncoder with a custom encoder. Something like this:
import numpy as np
class TolerantLabelEncoder(LabelEncoder):
def __init__(self, ignore_unknown=False,
unknown_original_value='unknown',
unknown_encoded_value=-1):
self.ignore_unknown = ignore_unknown
self.unknown_original_value = unknown_original_value
self.unknown_encoded_value = unknown_encoded_value
def transform(self, y):
check_is_fitted(self, 'classes_')
y = column_or_1d(y, warn=True)
indices = np.isin(y, self.classes_)
if not self.ignore_unknown and not np.all(indices):
raise ValueError("y contains new labels: %s"
% str(np.setdiff1d(y, self.classes_)))
y_transformed = np.searchsorted(self.classes_, y)
y_transformed[~indices]=self.unknown_encoded_value
return y_transformed
def inverse_transform(self, y):
check_is_fitted(self, 'classes_')
labels = np.arange(len(self.classes_))
indices = np.isin(y, labels)
if not self.ignore_unknown and not np.all(indices):
raise ValueError("y contains new labels: %s"
% str(np.setdiff1d(y, self.classes_)))
y_transformed = np.asarray(self.classes_[y], dtype=object)
y_transformed[~indices]=self.unknown_original_value
return y_transformed
Example Usage:
en = TolerantLabelEncoder(ignore_unknown=True)
en.fit(['a','b'])
print(en.transform(['a', 'c', 'b']))
# Output: [ 0 -1 1]
print(en.inverse_transform([-1, 0, 1]))
# Output: ['unknown' 'a' 'b']
来源:https://stackoverflow.com/questions/50041551/tell-labelenocder-to-ignore-new-labels