I\'m currently exploring the scikit learn pipelines. I also want to preprocess the data with a pipeline. However, my train and test data have different levels of the categor
You can use categoricals as explained in this answer:
categories = np.union1d(train, test)
train = train.astype('category', categories=categories)
test = test.astype('category', categories=categories)
pd.get_dummies(train)
Out:
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 1 0 0 0
4 1 0 0 0
pd.get_dummies(test)
Out:
a b c d
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1