Label encoding across multiple columns in scikit-learn

后端 未结 22 1972
礼貌的吻别
礼貌的吻别 2020-11-22 09:02

I\'m trying to use scikit-learn\'s LabelEncoder to encode a pandas DataFrame of string labels. As the dataframe has many (50+) columns, I want to a

相关标签:
22条回答
  • 2020-11-22 09:33

    No, LabelEncoder does not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It's designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).

    0 讨论(0)
  • 2020-11-22 09:33

    Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder() object that can be used to represent your columns, all you have to do is:

    le.fit(df.columns)
    

    In the above code you will have a unique number corresponding to each column. More precisely, you will have a 1:1 mapping of df.columns to le.transform(df.columns.get_values()). To get a column's encoding, simply pass it to le.transform(...). As an example, the following will get the encoding for each column:

    le.transform(df.columns.get_values())
    

    Assuming you want to create a sklearn.preprocessing.LabelEncoder() object for all of your row labels you can do the following:

    le.fit([y for x in df.get_values() for y in x])
    

    In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_. You'll note that this should have the same elements as in set(y for x in df.get_values() for y in x). Once again to convert a row label to an encoded label use le.transform(...). As an example, if you want to retrieve the label for the first column in the df.columns array and the first row, you could do this:

    le.transform([df.get_value(0, df.columns[0])])
    

    The question you had in your comment is a bit more complicated, but can still be accomplished:

    le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])
    

    The above code does the following:

    1. Make a unique combination of all of the pairs of (column, row)
    2. Represent each pair as a string version of the tuple. This is a workaround to overcome the LabelEncoder class not supporting tuples as a class name.
    3. Fits the new items to the LabelEncoder.

    Now to use this new model it's a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:

    le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])
    

    Remember that each lookup is now a string representation of a tuple that contains the (column, row).

    0 讨论(0)
  • 2020-11-22 09:36

    this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)

    However, for the purpose of a few classification tasks etc. you could use

    pandas.get_dummies(input_df) 
    

    this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more

    0 讨论(0)
  • 2020-11-22 09:38

    if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python

    def stringtocategory(dataset):
        '''
        @author puja.sharma
        @see The function label encodes the object type columns and gives label      encoded and inverse tranform of the label encoded data
        @param dataset dataframe on whoes column the label encoding has to be done
        @return label encoded and inverse tranform of the label encoded data.
       ''' 
       data_original = dataset[:]
       data_tranformed = dataset[:]
       for y in dataset.columns:
           #check the dtype of the column object type contains strings or chars
           if (dataset[y].dtype == object):
              print("The string type features are  : " + y)
              le = preprocessing.LabelEncoder()
              le.fit(dataset[y].unique())
              #label encoded data
              data_tranformed[y] = le.transform(dataset[y])
              #inverse label transform  data
              data_original[y] = le.inverse_transform(data_tranformed[y])
       return data_tranformed,data_original
    
    0 讨论(0)
  • 2020-11-22 09:39

    Since scikit-learn 0.20 you can use sklearn.compose.ColumnTransformer and sklearn.preprocessing.OneHotEncoder:

    If you only have categorical variables, OneHotEncoder directly:

    from sklearn.preprocessing import OneHotEncoder
    
    OneHotEncoder(handle_unknown='ignore').fit_transform(df)
    

    If you have heterogeneously typed features:

    from sklearn.compose import make_column_transformer
    from sklearn.preprocessing import RobustScaler
    from sklearn.preprocessing import OneHotEncoder
    
    categorical_columns = ['pets', 'owner', 'location']
    numerical_columns = ['age', 'weigth', 'height']
    column_trans = make_column_transformer(
        (categorical_columns, OneHotEncoder(handle_unknown='ignore'),
        (numerical_columns, RobustScaler())
    column_trans.fit_transform(df)
    

    More options in the documentation: http://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data

    0 讨论(0)
  • 2020-11-22 09:39

    I checked the source code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py) of LabelEncoder. It was based on a set of numpy transformation, which one of those is np.unique(). And this function only takes 1-d array input. (correct me if I am wrong).

    Very Rough ideas... first, identify which columns needed LabelEncoder, then loop through each column.

    def cat_var(df): 
        """Identify categorical features. 
    
        Parameters
        ----------
        df: original df after missing operations 
    
        Returns
        -------
        cat_var_df: summary df with col index and col name for all categorical vars
        """
        col_type = df.dtypes
        col_names = list(df)
    
        cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
        cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]
    
        cat_var_df = pd.DataFrame({'cat_ind': cat_var_index, 
                                   'cat_name': cat_var_name})
    
        return cat_var_df
    
    
    
    from sklearn.preprocessing import LabelEncoder 
    
    def column_encoder(df, cat_var_list):
        """Encoding categorical feature in the dataframe
    
        Parameters
        ----------
        df: input dataframe 
        cat_var_list: categorical feature index and name, from cat_var function
    
        Return
        ------
        df: new dataframe where categorical features are encoded
        label_list: classes_ attribute for all encoded features 
        """
    
        label_list = []
        cat_var_df = cat_var(df)
        cat_list = cat_var_df.loc[:, 'cat_name']
    
        for index, cat_feature in enumerate(cat_list): 
    
            le = LabelEncoder()
    
            le.fit(df.loc[:, cat_feature])    
            label_list.append(list(le.classes_))
    
            df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])
    
        return df, label_list
    

    The returned df would be the one after encoding, and label_list will show you what all those values means in the corresponding column. This is a snippet from a data process script I wrote for work. Let me know if you think there could be any further improvement.

    EDIT: Just want to mention here that the methods above work with data frame with no missing the best. Not sure how it is working toward data frame contains missing data. (I had a deal with missing procedure before execute above methods)

    0 讨论(0)
提交回复
热议问题