Label encoding across multiple columns in scikit-learn

后端未结

关注

 22  1971

I\'m trying to use scikit-learn\'s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to a

相关标签:

22条回答

有刺的猬

2020-11-22 09:22
Instead of LabelEncoder we can use OrdinalEncoder from scikit learn, which allows multi-column encoding.

Encode categorical features as an integer array. The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.
```
>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OrdinalEncoder()
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 3], ['Male', 1]])
array([[0., 2.],
       [1., 0.]])
```
Both the description and example were copied from its documentation page which you can find here:

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html#sklearn.preprocessing.OrdinalEncoder
0 讨论(0)
发布评论:

提交评论
- 加载中...
一整个雨季

2020-11-22 09:25

The problem is the shape of the data (pd dataframe) you are passing to the fit function. You've got to pass 1d list.

0 讨论(0)
发布评论:

提交评论
- 加载中...

萌比男神i

2020-11-22 09:26

It is possible to do this all in pandas directly and is well-suited for a unique ability of the replace method.

First, let's make a dictionary of dictionaries mapping the columns and their values to their new replacement values.

transform_dict = {}
for col in df.columns:
    cats = pd.Categorical(df[col]).categories
    d = {}
    for i, cat in enumerate(cats):
        d[cat] = i
    transform_dict[col] = d

transform_dict
{'location': {'New_York': 0, 'San_Diego': 1},
 'owner': {'Brick': 0, 'Champ': 1, 'Ron': 2, 'Veronica': 3},
 'pets': {'cat': 0, 'dog': 1, 'monkey': 2}}

Since this will always be a one to one mapping, we can invert the inner dictionary to get a mapping of the new values back to the original.

inverse_transform_dict = {}
for col, d in transform_dict.items():
    inverse_transform_dict[col] = {v:k for k, v in d.items()}

inverse_transform_dict
{'location': {0: 'New_York', 1: 'San_Diego'},
 'owner': {0: 'Brick', 1: 'Champ', 2: 'Ron', 3: 'Veronica'},
 'pets': {0: 'cat', 1: 'dog', 2: 'monkey'}}

Now, we can use the unique ability of the replace method to take a nested list of dictionaries and use the outer keys as the columns, and the inner keys as the values we would like to replace.

df.replace(transform_dict)
   location  owner  pets
0         1      1     0
1         0      2     1
2         0      0     0
3         1      1     2
4         1      3     1
5         0      2     1

We can easily go back to the original by again chaining the replace method

df.replace(transform_dict).replace(inverse_transform_dict)
    location     owner    pets
0  San_Diego     Champ     cat
1   New_York       Ron     dog
2   New_York     Brick     cat
3  San_Diego     Champ  monkey
4  San_Diego  Veronica     dog
5   New_York       Ron     dog

0 讨论(0)

名媛妹妹

2020-11-22 09:27
If you have all the features of type object then the first answer written above works well https://stackoverflow.com/a/31939145/5840973.

But, Suppose when we have mixed type columns. Then we can fetch the list of features names of type object type programmatically and then Label Encode them.
```
#Fetch features of type Object
objFeatures = dataframe.select_dtypes(include="object").columns

#Iterate a loop for features of type object
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

for feat in objFeatures:
    dataframe[feat] = le.fit_transform(dataframe[feat].astype(str))
 

dataframe.info()
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

感动是毒

2020-11-22 09:27

How about this?

def MultiColumnLabelEncode(choice, columns, X):
    LabelEncoders = []
    if choice == 'encode':
        for i in enumerate(columns):
            LabelEncoders.append(LabelEncoder())
        i=0    
        for cols in columns:
            X[:, cols] = LabelEncoders[i].fit_transform(X[:, cols])
            i += 1
    elif choice == 'decode': 
        for cols in columns:
            X[:, cols] = LabelEncoders[i].inverse_transform(X[:, cols])
            i += 1
    else:
        print('Please select correct parameter "choice". Available parameters: encode/decode')

It is not the most efficient, however it works and it is super simple.

0 讨论(0)

别那么骄傲

2020-11-22 09:29
Mainly used @Alexander answer but had to make some changes -
```
cols_need_mapped = ['col1', 'col2']

mapper = {col: {cat: n for n, cat in enumerate(df[col].astype('category').cat.categories)} 
     for col in df[cols_need_mapped]}

for c in cols_need_mapped :
    df[c] = df[c].map(mapper[c])
```
Then to re-use in the future you can just save the output to a json document and when you need it you read it in and use the .map() function like I did above.
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 3 4 下一页