Label encoding across multiple columns in scikit-learn

后端 未结 22 1969
礼貌的吻别
礼貌的吻别 2020-11-22 09:02

I\'m trying to use scikit-learn\'s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to a

22条回答
  •  花落未央
    2020-11-22 09:46

    As mentioned by larsmans, LabelEncoder() only takes a 1-d array as an argument. That said, it is quite easy to roll your own label encoder that operates on multiple columns of your choosing, and returns a transformed dataframe. My code here is based in part on Zac Stewart's excellent blog post found here.

    Creating a custom encoder involves simply creating a class that responds to the fit(), transform(), and fit_transform() methods. In your case, a good start might be something like this:

    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.pipeline import Pipeline
    
    # Create some toy data in a Pandas dataframe
    fruit_data = pd.DataFrame({
        'fruit':  ['apple','orange','pear','orange'],
        'color':  ['red','orange','green','green'],
        'weight': [5,6,3,4]
    })
    
    class MultiColumnLabelEncoder:
        def __init__(self,columns = None):
            self.columns = columns # array of column names to encode
    
        def fit(self,X,y=None):
            return self # not relevant here
    
        def transform(self,X):
            '''
            Transforms columns of X specified in self.columns using
            LabelEncoder(). If no columns specified, transforms all
            columns in X.
            '''
            output = X.copy()
            if self.columns is not None:
                for col in self.columns:
                    output[col] = LabelEncoder().fit_transform(output[col])
            else:
                for colname,col in output.iteritems():
                    output[colname] = LabelEncoder().fit_transform(col)
            return output
    
        def fit_transform(self,X,y=None):
            return self.fit(X,y).transform(X)
    

    Suppose we want to encode our two categorical attributes (fruit and color), while leaving the numeric attribute weight alone. We could do this as follows:

    MultiColumnLabelEncoder(columns = ['fruit','color']).fit_transform(fruit_data)
    

    Which transforms our fruit_data dataset from

    enter image description here to

    enter image description here

    Passing it a dataframe consisting entirely of categorical variables and omitting the columns parameter will result in every column being encoded (which I believe is what you were originally looking for):

    MultiColumnLabelEncoder().fit_transform(fruit_data.drop('weight',axis=1))
    

    This transforms

    enter image description here to

    enter image description here.

    Note that it'll probably choke when it tries to encode attributes that are already numeric (add some code to handle this if you like).

    Another nice feature about this is that we can use this custom transformer in a pipeline:

    encoding_pipeline = Pipeline([
        ('encoding',MultiColumnLabelEncoder(columns=['fruit','color']))
        # add more pipeline steps as needed
    ])
    encoding_pipeline.fit_transform(fruit_data)
    

提交回复
热议问题