问题
I'm building a model in R, while excluding 'office' column in the formula (it sometimes contains hints of the class I predict ). I'm learning on 'train' and predicting on 'test':
> model <- randomForest::randomForest(tc ~ . - office, data=train, importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")
the prediction resulted with all NAs:
> head(prediction)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005
the reason is that test$office contains NAs:
> head(test$office)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005
I can fix the problem by removing the NAs:
> test2 <- test
> test2$office <- 1
> prediction <- predict(model, test2, type = "class")
> head(prediction)
3 5 10 12 14 18
2921 2752 2921 2752 2921 2752
Levels: 2668 2752 2921 3005
I can avoid the problem by explicitly removing the column 'office' from the train data, rather then from the formula:
> model <- randomForest::randomForest(tc ~ ., data=train[,!(names(train) %in% c('office'))], importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")
> head(prediction)
3 5 10 12 14 18
3005 2752 3005 2752 2921 2752
Levels: 2668 2752 2921 3005
>
my question - what is the reason for that behavior?
was the formula tc ~ . - office
meant to exclude 'office' from the model?
is there an elegant solution here?
EDITION:
user agenis asked for the result of str(test); I masked some of the field names:
str(test)
'data.frame': 792 obs. of 15 variables:
$ XXX : Factor w/ 2 levels "Force","Retry": 1 2 2 1 2 2 1 1 1 1 ...
$ XXX : Factor w/ 15 levels "25 Westend, Birmingham",..: 6 13 6 15 13 15 10 3 5 12 ...
$ XXX : Factor w/ 3 levels "Instructions Info 1",..: 2 2 3 2 2 2 2 3 3 3 ...
$ XXX : Factor w/ 3 levels "Remittance Info 1",..: 3 1 3 1 2 2 1 1 1 1 ...
$ XXX : Factor w/ 3 levels "CRED","DEBT",..: 3 2 1 2 1 2 1 2 2 3 ...
$ XXX : Factor w/ 3 levels "INTC","LOAN",..: 2 2 2 3 1 3 1 1 3 3 ...
$ XXX : Factor w/ 15 levels "25 Westend, Birmingham",..: 3 9 15 14 5 15 10 11 2 7 ...
$ XXX : Factor w/ 2 levels "SDVA","URGP": 1 2 1 1 1 2 2 2 2 1 ...
$ XXX : Factor w/ 3 levels "CNY","EUR","GBP": 1 2 1 1 2 1 2 1 2 3 ...
$ XXX : Factor w/ 19 levels "BNKADE22XXX",..: 3 19 11 11 4 8 8 8 19 3 ...
$ XXX : Factor w/ 4 levels "_NV_E_","CNY",..: 1 3 2 2 3 2 3 2 3 1 ...
$ XXX : Factor w/ 9 levels "BNKADE22XXX",..: 3 9 1 1 4 8 8 8 9 3 ...
$ tc : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...
$ office : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...
Shay
回答1:
When you use:
FIT <- glm(tc~., data = train)
you are using all the variables but tc
(is the response variable) as explanatory variables.
Furthermore, when you run
FIT <- glm(tc~. - office, data = train)
you are using all the variables but tc
(is the response variable) and office
as explanatory variables.
回答2:
For some reason, the randomForest
function is first checking the presence of missing values in the whole data before looking at what's inside your formula.
It returns an error if you have NA wherever columns they are:
Error in na.fail.default(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, : missing values in object
If there are no missing observations, the formula you specified is correct and will not use the column specified with the minus sign.
Two possibilities then:
- Specify the argument
na.action=na.pass
to bypass the first NA check, the algorithm will run smoothly without error. This argument means litteraly "take no action" and see what's happens if you keep the NA. It's different fromna.exclude
that will remove the entire rows (which you don't want because the other variables of the row are non-missing) - Pre-process manually the data to either remove the missing or the entire column.
Code example:
df=mtcars
df[2:10, 'am'] <- NA
fit=randomForest::randomForest(mpg~.-am, df, na.action=na.pass)
fit$importance # check the absence of AM variable:
#### IncNodePurity
#### cyl 169.05853
#### disp 267.94975
#### hp 167.03634
#### drat 66.45550
#### wt 276.21383
#### qsec 25.33688
#### vs 30.48513
#### gear 15.39151
#### carb 24.60022
来源:https://stackoverflow.com/questions/46464765/r-variable-exclusion-from-formula-not-working-in-presence-of-missing-data