问题
I have a training data set of 20 column , all of which are factors which i have to use for training a model, I have been given test data set on which I have to apply my model for predictions and submit.
I was doing initial data exploration and just out of curiosity checked the levels of training data and testing data levels since we are dealing with all categorical variables.To my dismay most of the categories (variables) have different levels in training and testing data set.
for example
table(train$cap.shape) #training data column levels
b c f k x
196 4 2356 828 2300
table(test$cap.shape) #test data
b f s x
256 796 32 1356
Here I have a category s extra in test data set , how can I handle these cases, the extra category of c in training is very low , so I was thinking to merge that factor with other factor based on how its distribution is with dependent variables, but I am stuck on how to handle the extra level in test.
More examples
table(train$odor) #train
c f m n p s y
189 2155 36 2150 2 576 576
table(test$odor) #test
a c f l n p
400 3 5 400 1378 254
In this column we have 2 extra levels in test with substantial number of instances in test data set. How can I handle these discrepancies.
table(train$sColour) #train
b h k n o r w y
48 1627 700 753 48 72 2388 48
table(test$sColour) #test
h k n u
5 1172 1215 48
Here we have extra factor of u
Should I first build a model just on the training set and find the important predictors and then worry about the factor levels ?
回答1:
Having different feature sets violates a basic precept of machine learning. The training and test data must represent the same data space. These do not; although each pair has a common kernel of features (dimensions), to use them on the same model, you would have to reduce each set to only the common features, or extend both to the union of the features, filling in "don't care" or semantically null values for the extra features.
来源:https://stackoverflow.com/questions/40536257/handling-different-factor-levels-in-train-and-test-data