Label encoding across multiple columns in scikit-learn

后端未结

关注

 22  1970

礼貌的吻别 2020-11-22 09:02

I\'m trying to use scikit-learn\'s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to a

22条回答

遇见更好的自我 (楼主)

2020-11-22 09:33
Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder() object that can be used to represent your columns, all you have to do is:
```
le.fit(df.columns)
```
In the above code you will have a unique number corresponding to each column. More precisely, you will have a 1:1 mapping of df.columns to le.transform(df.columns.get_values()). To get a column's encoding, simply pass it to le.transform(...). As an example, the following will get the encoding for each column:
```
le.transform(df.columns.get_values())
```
Assuming you want to create a sklearn.preprocessing.LabelEncoder() object for all of your row labels you can do the following:
```
le.fit([y for x in df.get_values() for y in x])
```
In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_. You'll note that this should have the same elements as in set(y for x in df.get_values() for y in x). Once again to convert a row label to an encoded label use le.transform(...). As an example, if you want to retrieve the label for the first column in the df.columns array and the first row, you could do this:
```
le.transform([df.get_value(0, df.columns[0])])
```
The question you had in your comment is a bit more complicated, but can still be accomplished:
```
le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])
```
The above code does the following:
1. Make a unique combination of all of the pairs of (column, row)
2. Represent each pair as a string version of the tuple. This is a workaround to overcome the LabelEncoder class not supporting tuples as a class name.
3. Fits the new items to the LabelEncoder.
Now to use this new model it's a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:
```
le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])
```
Remember that each lookup is now a string representation of a tuple that contains the (column, row).
0 讨论(0)

查看其它22个回答
发布评论:

提交评论
- 加载中...