Handling different Factor Levels in Train and Test data

问题

I have a training data set of 20 column , all of which are factors which i have to use for training a model, I have been given test data set on which I have to apply my model for predictions and submit.

I was doing initial data exploration and just out of curiosity checked the levels of training data and testing data levels since we are dealing with all categorical variables.To my dismay most of the categories (variables) have different levels in training and testing data set.

for example

table(train$cap.shape) #training data column levels
  b    c    f    k    x 
196    4 2356  828 2300

table(test$cap.shape) #test data 

 b    f    s    x 
256  796   32 1356

Here I have a category s extra in test data set , how can I handle these cases, the extra category of c in training is very low , so I was thinking to merge that factor with other factor based on how its distribution is with dependent variables, but I am stuck on how to handle the extra level in test.

More examples

table(train$odor) #train
  c    f    m    n    p    s    y 
 189 2155   36 2150    2  576  576

table(test$odor) #test

  a    c    f    l    n    p 
400    3    5  400 1378  254

In this column we have 2 extra levels in test with substantial number of instances in test data set. How can I handle these discrepancies.

table(train$sColour) #train
    b    h    k    n    o    r    w    y 
   48 1627  700  753   48   72 2388   48

   table(test$sColour) #test
    h    k    n    u 
    5 1172 1215   48

Here we have extra factor of u

Should I first build a model just on the training set and find the important predictors and then worry about the factor levels ?

回答1:

Having different feature sets violates a basic precept of machine learning. The training and test data must represent the same data space. These do not; although each pair has a common kernel of features (dimensions), to use them on the same model, you would have to reduce each set to only the common features, or extend both to the union of the features, filling in "don't care" or semantically null values for the extra features.

来源：https://stackoverflow.com/questions/40536257/handling-different-factor-levels-in-train-and-test-data

标签

machine-learning

classification

random-forest

categorical-data