I\'m trying to use scikit-learn\'s LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to a
Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder()
object that can be used to represent your columns, all you have to do is:
le.fit(df.columns)
In the above code you will have a unique number corresponding to each column.
More precisely, you will have a 1:1 mapping of df.columns
to le.transform(df.columns.get_values())
. To get a column's encoding, simply pass it to le.transform(...)
. As an example, the following will get the encoding for each column:
le.transform(df.columns.get_values())
Assuming you want to create a sklearn.preprocessing.LabelEncoder()
object for all of your row labels you can do the following:
le.fit([y for x in df.get_values() for y in x])
In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_
. You'll note that this should have the same elements as in set(y for x in df.get_values() for y in x)
. Once again to convert a row label to an encoded label use le.transform(...)
. As an example, if you want to retrieve the label for the first column in the df.columns
array and the first row, you could do this:
le.transform([df.get_value(0, df.columns[0])])
The question you had in your comment is a bit more complicated, but can still be accomplished:
le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])
The above code does the following:
LabelEncoder
class not supporting tuples as a class name.LabelEncoder
.Now to use this new model it's a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:
le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])
Remember that each lookup is now a string representation of a tuple that contains the (column, row).