Label encoding across multiple columns in scikit-learn

后端 未结 22 1914
礼貌的吻别
礼貌的吻别 2020-11-22 09:02

I\'m trying to use scikit-learn\'s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to a

相关标签:
22条回答
  • 2020-11-22 09:22

    Instead of LabelEncoder we can use OrdinalEncoder from scikit learn, which allows multi-column encoding.

    Encode categorical features as an integer array. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

    >>> from sklearn.preprocessing import OrdinalEncoder
    >>> enc = OrdinalEncoder()
    >>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
    >>> enc.fit(X)
    OrdinalEncoder()
    >>> enc.categories_
    [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
    >>> enc.transform([['Female', 3], ['Male', 1]])
    array([[0., 2.],
           [1., 0.]])
    

    Both the description and example were copied from its documentation page which you can find here:

    https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder

    0 讨论(0)
  • 2020-11-22 09:25

    The problem is the shape of the data (pd dataframe) you are passing to the fit function. You've got to pass 1d list.

    0 讨论(0)
  • 2020-11-22 09:26

    It is possible to do this all in pandas directly and is well-suited for a unique ability of the replace method.

    First, let's make a dictionary of dictionaries mapping the columns and their values to their new replacement values.

    transform_dict = {}
    for col in df.columns:
        cats = pd.Categorical(df[col]).categories
        d = {}
        for i, cat in enumerate(cats):
            d[cat] = i
        transform_dict[col] = d
    
    transform_dict
    {'location': {'New_York': 0, 'San_Diego': 1},
     'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
     'pets': {'cat': 0, 'dog': 1, 'monkey': 2}}
    

    Since this will always be a one to one mapping, we can invert the inner dictionary to get a mapping of the new values back to the original.

    inverse_transform_dict = {}
    for col, d in transform_dict.items():
        inverse_transform_dict[col] = {v:k for k, v in d.items()}
    
    inverse_transform_dict
    {'location': {0: 'New_York', 1: 'San_Diego'},
     'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
     'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}
    

    Now, we can use the unique ability of the replace method to take a nested list of dictionaries and use the outer keys as the columns, and the inner keys as the values we would like to replace.

    df.replace(transform_dict)
       location  owner  pets
    0         1      1     0
    1         0      2     1
    2         0      0     0
    3         1      1     2
    4         1      3     1
    5         0      2     1
    

    We can easily go back to the original by again chaining the replace method

    df.replace(transform_dict).replace(inverse_transform_dict)
        location     owner    pets
    0  San_Diego     Champ     cat
    1   New_York       Ron     dog
    2   New_York     Brick     cat
    3  San_Diego     Champ  monkey
    4  San_Diego  Veronica     dog
    5   New_York       Ron     dog
    
    0 讨论(0)
  • 2020-11-22 09:27

    If you have all the features of type object then the first answer written above works well https://stackoverflow.com/a/31939145/5840973.

    But, Suppose when we have mixed type columns. Then we can fetch the list of features names of type object type programmatically and then Label Encode them.

    #Fetch features of type Object
    objFeatures = dataframe.select_dtypes(include="object").columns
    
    #Iterate a loop for features of type object
    from sklearn import preprocessing
    le = preprocessing.LabelEncoder()
    
    for feat in objFeatures:
        dataframe[feat] = le.fit_transform(dataframe[feat].astype(str))
     
    
    dataframe.info()
    
    0 讨论(0)
  • 2020-11-22 09:27

    How about this?

    def MultiColumnLabelEncode(choice, columns, X):
        LabelEncoders = []
        if choice == 'encode':
            for i in enumerate(columns):
                LabelEncoders.append(LabelEncoder())
            i=0    
            for cols in columns:
                X[:, cols] = LabelEncoders[i].fit_transform(X[:, cols])
                i += 1
        elif choice == 'decode': 
            for cols in columns:
                X[:, cols] = LabelEncoders[i].inverse_transform(X[:, cols])
                i += 1
        else:
            print('Please select correct parameter "choice". Available parameters: encode/decode')
    

    It is not the most efficient, however it works and it is super simple.

    0 讨论(0)
  • 2020-11-22 09:29

    Mainly used @Alexander answer but had to make some changes -

    cols_need_mapped = ['col1', 'col2']
    
    mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)} 
         for col in df[cols_need_mapped]}
    
    for c in cols_need_mapped :
        df[c] = df[c].map(mapper[c])
    

    Then to re-use in the future you can just save the output to a json document and when you need it you read it in and use the .map() function like I did above.

    0 讨论(0)
提交回复
热议问题