Label encoding across multiple columns in scikit-learn

后端 未结 22 1949
礼貌的吻别
礼貌的吻别 2020-11-22 09:02

I\'m trying to use scikit-learn\'s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to a

22条回答
  •  既然无缘
    2020-11-22 09:40

    Using Neuraxle

    TLDR; You here can use the FlattenForEach wrapper class to simply transform your df like: FlattenForEach(LabelEncoder(), then_unflatten=True).fit_transform(df).

    With this method, your label encoder will be able to fit and transform within a regular scikit-learn Pipeline. Let's simply import:

    from sklearn.preprocessing import LabelEncoder
    from neuraxle.steps.column_transformer import ColumnTransformer
    from neuraxle.steps.loop import FlattenForEach
    

    Same shared encoder for columns:

    Here is how one shared LabelEncoder will be applied on all the data to encode it:

        p = FlattenForEach(LabelEncoder(), then_unflatten=True)
    

    Result:

        p, predicted_output = p.fit_transform(df.values)
        expected_output = np.array([
            [6, 7, 6, 8, 7, 7],
            [1, 3, 0, 1, 5, 3],
            [4, 2, 2, 4, 4, 2]
        ]).transpose()
        assert np.array_equal(predicted_output, expected_output)
    

    Different encoders per column:

    And here is how a first standalone LabelEncoder will be applied on the pets, and a second will be shared for the columns owner and location. So to be precise, we here have a mix of different and shared label encoders:

        p = ColumnTransformer([
            # A different encoder will be used for column 0 with name "pets":
            (0, FlattenForEach(LabelEncoder(), then_unflatten=True)),
            # A shared encoder will be used for column 1 and 2, "owner" and "location":
            ([1, 2], FlattenForEach(LabelEncoder(), then_unflatten=True)),
        ], n_dimension=2)
    

    Result:

        p, predicted_output = p.fit_transform(df.values)
        expected_output = np.array([
            [0, 1, 0, 2, 1, 1],
            [1, 3, 0, 1, 5, 3],
            [4, 2, 2, 4, 4, 2]
        ]).transpose()
        assert np.array_equal(predicted_output, expected_output)
    

提交回复
热议问题