I\'m trying to use scikit-learn\'s LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to a
TLDR; You here can use the FlattenForEach wrapper class to simply transform your df like:
FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df)
.
With this method, your label encoder will be able to fit and transform within a regular scikit-learn Pipeline. Let's simply import:
from sklearn.preprocessing import LabelEncoder
from neuraxle.steps.column_transformer import ColumnTransformer
from neuraxle.steps.loop import FlattenForEach
Here is how one shared LabelEncoder will be applied on all the data to encode it:
p = FlattenForEach(LabelEncoder(), then_unflatten=True)
Result:
p, predicted_output = p.fit_transform(df.values)
expected_output = np.array([
[6, 7, 6, 8, 7, 7],
[1, 3, 0, 1, 5, 3],
[4, 2, 2, 4, 4, 2]
]).transpose()
assert np.array_equal(predicted_output, expected_output)
And here is how a first standalone LabelEncoder will be applied on the pets, and a second will be shared for the columns owner and location. So to be precise, we here have a mix of different and shared label encoders:
p = ColumnTransformer([
# A different encoder will be used for column 0 with name "pets":
(0, FlattenForEach(LabelEncoder(), then_unflatten=True)),
# A shared encoder will be used for column 1 and 2, "owner" and "location":
([1, 2], FlattenForEach(LabelEncoder(), then_unflatten=True)),
], n_dimension=2)
Result:
p, predicted_output = p.fit_transform(df.values)
expected_output = np.array([
[0, 1, 0, 2, 1, 1],
[1, 3, 0, 1, 5, 3],
[4, 2, 2, 4, 4, 2]
]).transpose()
assert np.array_equal(predicted_output, expected_output)