Suppose I have a Pandas DataFrame like the below and I\'m encoding categorical_1 for training in scikit-learn:
data = {\'numeric_1\':[12.1, 3.2, 5.5, 6.8, 9.9],
I encountered the same problem as yours, that is how to unify the dummy categories between training data and testing data when using get_dummies()
in Pandas. Then I found a solution when exploring the House Price competition in Kaggle, that is to process training data and testing data at the same time. Suppose you have two dataframes df_train
and df_test
(not containing target data in them).
all_data = pd.concat([df_train,df_test], axis=0)
all_data = pd.get_dummies(all_data)
X_train = all_data[:df_train.shape[0]] # select the processed training data
X_test = all_data[-df_test.shape[0]:] # select the processed testing data
Hope it helps.