问题
I am trying to use random forest in R for classifying some kaggle data but I keep getting the following error whenever I try to use the model which I have created.
Error in predict.randomForest(fit, newdata = test, type = "class") :
Type of predictors in new data do not match that of the training data
I am totally lost as to the reason for this error and Google has not been of much help. Any help or insight will be appreciated. The simple code snippet is given below and its in response to one of the kaggle problems.
fit = randomForest(as.factor(IsBadBuy) ~ VehicleAge + WheelTypeID + Transmission + WarrantyCost + VehOdo + Auction,
data=training, importance=TRUE, do.trace=100, keep.forest=TRUE)
prediction = predict(fit, newdata=test, type='class')
t = table(observed=test[, 'IsBadBuy'], predict=prediction)
回答1:
For a R newbie like me... They are right when they say "The error message means exactly what it says: there is at least one variable in your training data whose type does not match the equivalent variable in your test data."
Do run the following to confirm nothing is obviously different:str(training)
and str(NewData)
That will list the training and new data's features and types. The reason why you might still be confused, as I was, is the datatypes might appear to match and yet the error. It's probably that while a feature/column in both sets is listed as a factor the levels are not the same. My new data was much smaller, didn't have all the levels the training data did. That will blow you up with this error. The fix is: when you are processing your new data and go to factor it, pass in all the possible levels. That will get you to match and things will work.
dataframe$ColToFactor <- factor(dataframe$ColToFactor, levels=c("PossibleLvl1", "PossibleLvl2", "PossibleLvl3", account for all possible))
That was the deal for me.
回答2:
Take a look at this page, probably it will help:
http://gettinggeneticsdone.blogspot.be/2011/02/split-data-frame-into-testing-and.html
It explains how to split a Data Frame into Testing and Training Sets in R with an elegant function and how to use it in case of random forest.
回答3:
This error is mostly due to categorical predictors ,suppose a particular class of a categorical predictor occurs in training set while training the model but does not occurs in testing set while predicting this error occurs
(eg) consider a categorical predictor called "salary level" with three levels low,medium,high all these classes occurs atleast once in training set, but in testing set one of the class say "medium" doesn't occur at all then the variable "salary_level" is considered as a new or different variable with two classes in testing test by the predict function. hence the error data doesn't match.
you can overcome this by analyzing categorical variable's classes using function table(data_name$variable_name) or table(data_name[,columnposition])
回答4:
This is an old post but I see few months old activity. I myself came across this problem but could not find a solution in the web. I solved my problem with a rough solution.
The reason why we get such an error is described in other answers. Briefly, if there are unequal numbers of factor levels for a variable in training and test dataset, then you get such an error. Although if you have all levels in training data but you do not have all levels in test data, you get such problem (at least I got).
If you have a dataset and you want to split it to train and test, its better to split them so that all the levels are well represented in the training and test datasets. But, if you want to make a predictor that should work for unseen data, it is best to find a solution.
For example if you have a data frame with 3 levels in column "b".
a<-c(1,2,3,1,3,2,4,5)
b<-as.factor(c(1,2,3,2,3,1,1,2))
d<-c(3,2,5,2,4,2,4,4)
dat<-cbind(a,b,d)
And if you have a test data with only two levels in column "b".
a<-c(1,2,2,1,3)
b<-as.factor(c(1,2,1,1,2))
d<-c(3,2,5,2,4)
testData<-cbind(a,b,d)
Then, you get the error. In my dirty solution, I added three rows containing the factor levels in the test data and then later remove them after adding the factor levels.
testData[,2]<-as.character(testData[,2]) # First changing the factor to character
addition<-testDat[1:3,] ## this will be added to testData
addition[,2]<-c(1,2,3) ## Changing the content to get the known factor levels
testData<-rbind(addition,testData) ## add the new rows to the testData
testData[,2]<-as.factor(testData) ## And now converting back to factor
## And finally removing the added rows
testData<-testData[4:nrow(testData),]
My scripts are not neat and also the fix is not neat. But, I do this one step at a time to make it understandable when I come back later. May be somebody can write the same code in couple of lines.
来源:https://stackoverflow.com/questions/16172998/type-mismatch-error-using-randomforest-in-r