Keep same dummy variable in training and testing data

ぐ巨炮叔叔 提交于 2019-12-17 07:11:59

问题


I am building a prediction model in python with two separate training and testing sets. The training data contains numerical type categorical variable, e.g., zip code,[91521,23151,12355, ...], and also string categorical variables, e.g., city ['Chicago', 'New York', 'Los Angeles', ...].

To train the data, I first use the 'pd.get_dummies' to get dummy variable of these variable, and then fit the model with the transformed training data.

I do the same transformation on my test data and predict the result using the trained model. However, I got the error 'ValueError: Number of features of the model must match the input. Model n_features is 1487 and input n_features is 1345 '. The reason is because there are fewer dummy variables in the test data because it has fewer 'city' and 'zipcode'.

How can I solve this problem? For example, 'OneHotEncoder' will only encode all numerical type categorical variable. 'DictVectorizer()' will only encode all string type categorical variable. I search on line and see a few similar questions but none of them really addresses my question.

Handling categorical features using scikit-learn

https://www.quora.com/If-the-training-dataset-has-more-variables-than-the-test-dataset-what-does-one-do

https://www.quora.com/What-is-the-best-way-to-do-a-binary-one-hot-one-of-K-coding-in-Python


回答1:


You can also just get the missing columns and add them to the test dataset:

# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
    test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]

This code also ensure that column resulting from category in the test dataset but not present in the training dataset will be removed




回答2:


Assume you have identical feature's names in train and test dataset. You can generate concatenated dataset from train and test, get dummies from concatenated dataset and split it to train and test back.

You can do it this way:

import pandas as pd
train = pd.DataFrame(data = [['a', 123, 'ab'], ['b', 234, 'bc']],
                     columns=['col1', 'col2', 'col3'])
test = pd.DataFrame(data = [['c', 345, 'ab'], ['b', 456, 'ab']],
                     columns=['col1', 'col2', 'col3'])
train_objs_num = len(train)
dataset = pd.concat(objs=[train, test], axis=0)
dataset_preprocessed = pd.get_dummies(dataset)
train_preprocessed = dataset_preprocessed[:train_objs_num]
test_preprocessed = dataset_preprocessed[train_objs_num:]

In result, you have equal number of features for train and test dataset.




回答3:


train2,test2 = train.align(test, join='outer', axis=1, fill_value=0)

train2 and test2 have the same columns. Fill_value indicates the value to use for missing columns.




回答4:


This is a rather old question, but if you aim at using scikit learn API, you can use the following DummyEncoder class: https://gist.github.com/psinger/ef4592492dc8edf101130f0bf32f5ff9

What it does is that it utilizes the category dtype to specify which dummies to create as also elaborated here: Dummy creation in pipeline with different levels in train and test set



来源:https://stackoverflow.com/questions/41335718/keep-same-dummy-variable-in-training-and-testing-data

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!