I am trying to apply both imputation and hot one encoding on my data set. I know that on applying imputation, the dimension of data might change and so I took care of it manuall
The problem is in first two line. pd.get_dummies()
will return different columns for train and test if the data is different in them.
For example if in train, a column has 3 cateogories , 3 columns will be made for them, but it may happen that the test data only contains 2 categories in that specific column, in that you will get 2 columns after the pd.get_dummies()
. Which then will lead to different number of columns.
There are a couple of things you can do here:
1) Easiest Use pd.get_dummies()
on the whole data before train test split and then split the data. But its not recommended because it leaks the information of testing data to the model.
2) If you can use the development version of scikit, use CategoricalEncoder to perform the one hot encoding.
3) Use a combination of LabelEncoder + OneHotEncoder in the current scikit version to achieve the same. See my other answer for example.
Note
Also only call transform()
on the test data, never fit()
. Do this:-
# If you call fit_transform(), the imputer will again learn the
# new mean from the test data
# Which will lead to differences and data leakage.
imp_test_X = pd.DataFrame(imputer.transform(test_X))
I've been struggling with a similar problem and I've found an approach that might help in this situation.
The main idea is to modify the type of the column to make it categorical when you are working with the complete dataset. Doing something like this:
dataframe[column] = dataframe[column].astype('category')
When you do that the dataframe's column will saved all the available categories. Later when you perform a train/test split of the data the categories will be saved even though the values might not be presented on one of the dataset.
Pandas get_dummies function uses the categories of the column to perform the encoding. Since the categories are stable you will always get the same amount of columns after encoding.
I'm exploring this solution myself. Keep in mind that you can manipulate the categories directly in case you need to. You can use something like this
dataframe[column].cat.set_categories([.....])