sklearn.LabelEncoder with never seen before values

后端 未结 12 944
执笔经年
执笔经年 2020-11-27 10:37

If a sklearn.LabelEncoder has been fitted on a training set, it might break if it encounters new values when used on a test set.

The only solution I c

相关标签:
12条回答
  • 2020-11-27 11:09

    I ended up switching to Pandas' get_dummies due to this problem of unseen data.

    • create the dummies on the training data
      dummy_train = pd.get_dummies(train)
    • create the dummies in the new (unseen data)
      dummy_new = pd.get_dummies(new_data)
    • re-index the new data to the columns of the training data, filling the missing values with 0
      dummy_new.reindex(columns = dummy_train.columns, fill_value=0)

    Effectively any new features which are categorical will not go into the classifier, but I think that should not cause problems as it would not know what to do with them.

    0 讨论(0)
  • 2020-11-27 11:10

    LabelEncoder is basically a dictionary. You can extract and use it for future encoding:

    from sklearn.preprocessing import LabelEncoder
    
    le = preprocessing.LabelEncoder()
    le.fit(X)
    
    le_dict = dict(zip(le.classes_, le.transform(le.classes_)))
    

    Retrieve label for a single new item, if item is missing then set value as unknown

    le_dict.get(new_item, '<Unknown>')
    

    Retrieve labels for a Dataframe column:

    df[your_col] = df[your_col].apply(lambda x: le_dict.get(x, <unknown_value>))
    
    0 讨论(0)
  • 2020-11-27 11:13

    I was trying to deal with this problem and found two handy ways to encode categorical data from train and test sets with and without using LabelEncoder. New categories are filled with some known cetegory "c" (like "other" or "missing"). First method seems to work faster. Hope that will help you.

    import pandas as pd
    import time
    df=pd.DataFrame()
    
    df["a"]=['a','b', 'c', 'd']
    df["b"]=['a','b', 'e', 'd']
    
    
    #LabelEncoder + map
    t=time.clock()
    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    suf="_le"
    col="a"
    df[col+suf] = le.fit_transform(df[col])
    dic = dict(zip(le.classes_, le.transform(le.classes_)))
    col='b'
    df[col+suf]=df[col].map(dic).fillna(dic["c"]).astype(int)
    print(time.clock()-t)
    
    #---
    #pandas category
    
    t=time.clock()
    df["d"] = df["a"].astype('category').cat.codes
    dic =df["a"].astype('category').cat.categories.tolist()
    df['f']=df['b'].astype('category',categories=dic).fillna("c").cat.codes
    df.dtypes
    print(time.clock()-t)
    
    0 讨论(0)
  • 2020-11-27 11:15

    LabelEncoder() should be used only for target labels encoding. To encode categorical features, use OneHotEncoder(), which can handle unseen values: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder

    0 讨论(0)
  • 2020-11-27 11:16

    Here is with the use of the relatively new feature from pandas. The main motivation is machine learning packages like 'lightgbm' can accept pandas category as feature columns and it is better than using onehotencoding in some situations. And in this example, the transformer return an integer but can also change the date type and replace with the unseen categorical values with -1.

    from collections import defaultdict
    from sklearn.base import BaseEstimator,TransformerMixin
    from pandas.api.types import CategoricalDtype
    import pandas as pd
    import numpy as np
    
    class PandasLabelEncoder(BaseEstimator,TransformerMixin):
        def __init__(self):
            self.label_dict = defaultdict(list)
    
        def fit(self, X):
            X = X.astype('category')
            cols = X.columns
            values = list(map(lambda col: X[col].cat.categories, cols))
            self.label_dict = dict(zip(cols,values))
            # return as category for xgboost or lightgbm 
            return self
    
        def transform(self,X):
            # check missing columns
            missing_col=set(X.columns)-set(self.label_dict.keys())
            if missing_col:
                raise ValueError('the column named {} is not in the label dictionary. Check your fitting data.'.format(missing_col)) 
            return X.apply(lambda x: x.astype('category').cat.set_categories(self.label_dict[x.name]).cat.codes.astype('category').cat.set_categories(np.arange(len(self.label_dict[x.name]))))
    
    
        def inverse_transform(self,X):
            return X.apply(lambda x: pd.Categorical.from_codes(codes=x.values,
                                                               categories=self.label_dict[x.name]))
    
    dff1 = pd.DataFrame({'One': list('ABCC'), 'Two': list('bccd')})
    dff2 = pd.DataFrame({'One': list('ABCDE'), 'Two': list('debca')})
    
    
    enc=PandasLabelEncoder()
    enc.fit_transform(dff1)
    
    One Two
    0   0   0
    1   1   1
    2   2   1
    3   2   2
    
    dff3=enc.transform(dff2)
    dff3
    
        One Two
    0   0   2
    1   1   -1
    2   2   0
    3   -1  1
    4   -1  -1
    
    enc.inverse_transform(dff3)
    
    One Two
    0   A   d
    1   B   NaN
    2   C   b
    3   NaN c
    4   NaN NaN
    
    0 讨论(0)
  • 2020-11-27 11:17

    I recently ran into this problem and was able to come up with a pretty quick solution to the problem. My answer solves a little more than just this problem but it will easily work for your issue too. (I think its pretty cool)

    I am working with pandas data frames and originally used the sklearns labelencoder() to encode my data which I would then pickle to use in other modules in my program.

    However, the label encoder in sklearn's preprocessing does not have the ability to add new values to the encoding algorithm. I solved the problem of encoding multiple values and saving the mapping values AS WELL as being able to add new values to the encoder by (here's a rough outline of what I did):

    encoding_dict = dict()
    for col in cols_to_encode:
        #get unique values in the column to encode
        values = df[col].value_counts().index.tolist()
    
        # create a dictionary of values and corresponding number {value, number}
        dict_values = {value: count for value, count in zip(values, range(1,len(values)+1))}
    
        # save the values to encode in the dictionary
        encoding_dict[col] = dict_values
    
        # replace the values with the corresponding number from the dictionary
        df[col] = df[col].map(lambda x: dict_values.get(x))
    

    Then you can simply save the dictionary to a JSON file and are able to pull it and add any value you want by adding a new value and the corresponding integer value.

    I'll explain some reasoning behind using map() instead of replace(). I found that using pandas replace() function took over a minute to iterate through around 117,000 lines of code. Using map brought that time to just over 100 ms.

    TLDR: instead of using sklearns preprocessing just work with your dataframe by making a mapping dictionary and map out the values yourself.

    0 讨论(0)
提交回复
热议问题