Binary one-hot (also known as one-of-K) coding lies in making one binary column for each distinct value for a categorical variable. For example, if one has a color column (categ
If your columns are in the same order, you can concatenate the dfs, use get_dummies
, and then split them back again, e.g.,
encoded = pd.get_dummies(pd.concat([train,test], axis=0))
train_rows = train.shape[0]
train_encoded = encoded.iloc[:train_rows, :]
test_encoded = encoded.iloc[train_rows:, :]
If your columns are not in the same order, then you'll have challenges regardless of what method you try.
You can set your data type to categorical:
In [5]: df_train = pd.DataFrame({"car":Series(["seat","bmw"]).astype('category',categories=['seat','bmw','mercedes']),"color":["red","green"]})
In [6]: df_train
Out[6]:
car color
0 seat red
1 bmw green
In [7]: pd.get_dummies(df_train )
Out[7]:
car_seat car_bmw car_mercedes color_green color_red
0 1 0 0 0 1
1 0 1 0 1 0
See this issue of Pandas.