The basic task that I have at hand is
a) Read some tab separated data.
b) Do some basic preprocessing
c) For each categorical column use LabelE
For me the easiest way was exporting LabelEncoder as .pkl
file for each column. You have to export the encoder for each column after using the fit_transform()
function
For example
from sklearn.preprocessing import LabelEncoder
import pickle
import pandas as pd
df_train = pd.read_csv('traing_data.csv')
le = LabelEncoder()
df_train['Departure'] = le.fit_transform(df_train['Departure'])
#exporting the departure encoder
output = open('Departure_encoder.pkl', 'wb')
pickle.dump(le, output)
output.close()
Then in the testing project, you can load the LabelEncoder object and apply transform()
function directly
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df_test = pd.read_csv('testing_data.csv')
#load the encoder file
import pickle
pkl_file = open('Departure_encoder.pkl', 'rb')
le_departure = pickle.load(pkl_file)
pkl_file.close()
df_test['Departure'] = le_departure.transform(df_test['Departure'])
What works for me is LabelEncoder().fit(X_train[col])
, pickling these objects for each categorical column col
and then reusing the same objects for transforming the same categorical column col
in the validation dataset. Basically you have a label encoder object for each of your categorical columns.
fit()
on training data and pickle the objects/models corresponding to each column in the training dataframe X_train
. col
in columns of validation set X_cv
, load the corresponding object/model and apply the transformation by accessing the transform function as: transform(X_cv[col])
. You can do this after you have encoded the values with the "le" object:
encoding = {}
for i in list(le.classes_):
encoding[i]=le.transform([i])[0]
You will get the "encoding" dictionary with the encoding for later use, with pandas you can export this dictionary to a csv for example.
According to the LabelEncoder implementation, the pipeline you've described will work correctly if and only if you fit
LabelEncoders at the test time with data that have exactly the same set of unique values.
There's a somewhat hacky way to reuse LabelEncoders you got during train. LabelEncoder
has only one property, namely, classes_
. You can pickle it, and then restore like
Train:
encoder = LabelEncoder()
encoder.fit(X)
numpy.save('classes.npy', encoder.classes_)
Test
encoder = LabelEncoder()
encoder.classes_ = numpy.load('classes.npy')
# Now you should be able to use encoder
# as you would do after `fit`
This seems more efficient than refitting it using the same data.
As I found no other post about nominal/categorical encoding. I expand on the above-mentioned solutions and share mine for OrdinalEncoder approach (which maybe was intended by the author anyways)
I did the following with OrdinalEncoder (but should work with LabelEncoder as well). Note, that I am using categories_
instead of classes_
Note: np
stands for numpy.
# ------- step 1 and 2 in the file/cell where the encoding shall be exported
encoder_dict = dict()
for nom in nominal_columns:
enc = enc.fit(df[[nom]])
df[[nom]] = enc.transform(df[[nom]])
encoder_dict[nom] = [[str(cat) for cat in sublist] for sublist in enc.categories_]
np.save('FILE_NAME.npy', encoder_dict)
# ------------ step 3 and 4 in the file where encoding shall be imported
enc = OrdinalEncoder()
encoder_dict = np.load('FILE_NAME.npy', allow_pickle=True).tolist()
for nom in encoder_dict:
for col in df.columns:
if nom == col:
enc.categories_ = encoder_dict[nom]
df[[col]] = enc.transform(df[[col]])
return df