I\'m trying to use scikit-learn\'s LabelEncoder
to encode a pandas DataFrame
of string labels. As the dataframe has many (50+) columns, I want to a
No, LabelEncoder
does not do this. It takes 1-d arrays of class labels and produces 1-d arrays. It's designed to handle class labels in classification problems, not arbitrary data, and any attempt to force it into other uses will require code to transform the actual problem to the problem it solves (and the solution back to the original space).
Assuming you are simply trying to get a sklearn.preprocessing.LabelEncoder()
object that can be used to represent your columns, all you have to do is:
le.fit(df.columns)
In the above code you will have a unique number corresponding to each column.
More precisely, you will have a 1:1 mapping of df.columns
to le.transform(df.columns.get_values())
. To get a column's encoding, simply pass it to le.transform(...)
. As an example, the following will get the encoding for each column:
le.transform(df.columns.get_values())
Assuming you want to create a sklearn.preprocessing.LabelEncoder()
object for all of your row labels you can do the following:
le.fit([y for x in df.get_values() for y in x])
In this case, you most likely have non-unique row labels (as shown in your question). To see what classes the encoder created you can do le.classes_
. You'll note that this should have the same elements as in set(y for x in df.get_values() for y in x)
. Once again to convert a row label to an encoded label use le.transform(...)
. As an example, if you want to retrieve the label for the first column in the df.columns
array and the first row, you could do this:
le.transform([df.get_value(0, df.columns[0])])
The question you had in your comment is a bit more complicated, but can still be accomplished:
le.fit([str(z) for z in set((x[0], y) for x in df.iteritems() for y in x[1])])
The above code does the following:
LabelEncoder
class not supporting tuples as a class name.LabelEncoder
.Now to use this new model it's a bit more complicated. Assuming we want to extract the representation for the same item we looked up in the previous example (the first column in df.columns and the first row), we can do this:
le.transform([str((df.columns[0], df.get_value(0, df.columns[0])))])
Remember that each lookup is now a string representation of a tuple that contains the (column, row).
this does not directly answer your question (for which Naputipulu Jon and PriceHardman have fantastic replies)
However, for the purpose of a few classification tasks etc. you could use
pandas.get_dummies(input_df)
this can input dataframe with categorical data and return a dataframe with binary values. variable values are encoded into column names in the resulting dataframe. more
if we have single column to do the label encoding and its inverse transform its easy how to do it when there are multiple columns in python
def stringtocategory(dataset):
'''
@author puja.sharma
@see The function label encodes the object type columns and gives label encoded and inverse tranform of the label encoded data
@param dataset dataframe on whoes column the label encoding has to be done
@return label encoded and inverse tranform of the label encoded data.
'''
data_original = dataset[:]
data_tranformed = dataset[:]
for y in dataset.columns:
#check the dtype of the column object type contains strings or chars
if (dataset[y].dtype == object):
print("The string type features are : " + y)
le = preprocessing.LabelEncoder()
le.fit(dataset[y].unique())
#label encoded data
data_tranformed[y] = le.transform(dataset[y])
#inverse label transform data
data_original[y] = le.inverse_transform(data_tranformed[y])
return data_tranformed,data_original
Since scikit-learn 0.20 you can use sklearn.compose.ColumnTransformer
and sklearn.preprocessing.OneHotEncoder
:
If you only have categorical variables, OneHotEncoder
directly:
from sklearn.preprocessing import OneHotEncoder
OneHotEncoder(handle_unknown='ignore').fit_transform(df)
If you have heterogeneously typed features:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import OneHotEncoder
categorical_columns = ['pets', 'owner', 'location']
numerical_columns = ['age', 'weigth', 'height']
column_trans = make_column_transformer(
(categorical_columns, OneHotEncoder(handle_unknown='ignore'),
(numerical_columns, RobustScaler())
column_trans.fit_transform(df)
More options in the documentation: http://scikit-learn.org/stable/modules/compose.html#columntransformer-for-heterogeneous-data
I checked the source code (https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/preprocessing/label.py) of LabelEncoder. It was based on a set of numpy transformation, which one of those is np.unique(). And this function only takes 1-d array input. (correct me if I am wrong).
Very Rough ideas... first, identify which columns needed LabelEncoder, then loop through each column.
def cat_var(df):
"""Identify categorical features.
Parameters
----------
df: original df after missing operations
Returns
-------
cat_var_df: summary df with col index and col name for all categorical vars
"""
col_type = df.dtypes
col_names = list(df)
cat_var_index = [i for i, x in enumerate(col_type) if x=='object']
cat_var_name = [x for i, x in enumerate(col_names) if i in cat_var_index]
cat_var_df = pd.DataFrame({'cat_ind': cat_var_index,
'cat_name': cat_var_name})
return cat_var_df
from sklearn.preprocessing import LabelEncoder
def column_encoder(df, cat_var_list):
"""Encoding categorical feature in the dataframe
Parameters
----------
df: input dataframe
cat_var_list: categorical feature index and name, from cat_var function
Return
------
df: new dataframe where categorical features are encoded
label_list: classes_ attribute for all encoded features
"""
label_list = []
cat_var_df = cat_var(df)
cat_list = cat_var_df.loc[:, 'cat_name']
for index, cat_feature in enumerate(cat_list):
le = LabelEncoder()
le.fit(df.loc[:, cat_feature])
label_list.append(list(le.classes_))
df.loc[:, cat_feature] = le.transform(df.loc[:, cat_feature])
return df, label_list
The returned df would be the one after encoding, and label_list will show you what all those values means in the corresponding column. This is a snippet from a data process script I wrote for work. Let me know if you think there could be any further improvement.
EDIT: Just want to mention here that the methods above work with data frame with no missing the best. Not sure how it is working toward data frame contains missing data. (I had a deal with missing procedure before execute above methods)