问题

I'm building a model in R, while excluding 'office' column in the formula (it sometimes contains hints of the class I predict ). I'm learning on 'train' and predicting on 'test':

> model <- randomForest::randomForest(tc ~ . - office, data=train,     importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")

the prediction resulted with all NAs:

> head(prediction)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005

the reason is that test$office contains NAs:

> head(test$office)
[1] <NA> <NA> <NA> <NA> <NA> <NA>
Levels: 2668 2752 2921 3005

I can fix the problem by removing the NAs:

> test2 <- test
> test2$office <- 1
> prediction <- predict(model, test2, type = "class")
> head(prediction)
   3    5   10   12   14   18 
 2921 2752 2921 2752 2921 2752 
Levels: 2668 2752 2921 3005

I can avoid the problem by explicitly removing the column 'office' from the train data, rather then from the formula:

> model <- randomForest::randomForest(tc ~ ., data=train[,!(names(train) %in% c('office'))], importance=TRUE,proximity=TRUE )
> prediction <- predict(model, test, type = "class")
> head(prediction)
   3    5   10   12   14   18 
3005 2752 3005 2752 2921 2752 
Levels: 2668 2752 2921 3005
>

my question - what is the reason for that behavior?

was the formula tc ~ . - office meant to exclude 'office' from the model?

is there an elegant solution here?

EDITION:

user agenis asked for the result of str(test); I masked some of the field names:

str(test)
'data.frame':   792 obs. of  15 variables:
 $ XXX              : Factor w/ 2 levels "Force","Retry": 1 2 2 1 2 2 1 1 1 1 ...
 $ XXX                  : Factor w/ 15 levels "25 Westend, Birmingham",..: 6 13 6 15 13 15 10 3 5 12 ...
 $ XXX                  : Factor w/ 3 levels "Instructions Info 1",..: 2 2 3 2 2 2 2 3 3 3 ...
 $ XXX                  : Factor w/ 3 levels "Remittance Info 1",..: 3 1 3 1 2 2 1 1 1 1 ...
 $ XXX                  : Factor w/ 3 levels "CRED","DEBT",..: 3 2 1 2 1 2 1 2 2 3 ...
 $ XXX                  : Factor w/ 3 levels "INTC","LOAN",..: 2 2 2 3 1 3 1 1 3 3 ...
 $ XXX                  : Factor w/ 15 levels "25 Westend, Birmingham",..: 3 9 15 14 5 15 10 11 2 7 ...
 $ XXX                  : Factor w/ 2 levels "SDVA","URGP": 1 2 1 1 1 2 2 2 2 1 ...
 $ XXX                  : Factor w/ 3 levels "CNY","EUR","GBP": 1 2 1 1 2 1 2 1 2 3 ...
 $ XXX                  : Factor w/ 19 levels "BNKADE22XXX",..: 3 19 11 11 4 8 8 8 19 3 ...
 $ XXX                  : Factor w/ 4 levels "_NV_E_","CNY",..: 1 3 2 2 3 2 3 2 3 1 ...
 $ XXX                  : Factor w/ 9 levels "BNKADE22XXX",..: 3 9 1 1 4 8 8 8 9 3 ...
 $ tc                   : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...
 $ office               : Factor w/ 4 levels "604","688","698",..: NA NA NA NA NA NA NA NA NA NA ...

Shay

回答1:

When you use:

FIT <- glm(tc~., data = train)

you are using all the variables but tc (is the response variable) as explanatory variables.

Furthermore, when you run

FIT <- glm(tc~. - office, data = train)

you are using all the variables but tc (is the response variable) and office as explanatory variables.

回答2:

For some reason, the randomForest function is first checking the presence of missing values in the whole data before looking at what's inside your formula. It returns an error if you have NA wherever columns they are:

Error in na.fail.default(list(mpg = c(21, 21, 22.8, 21.4, 18.7, 18.1, : missing values in object

If there are no missing observations, the formula you specified is correct and will not use the column specified with the minus sign.

Two possibilities then:

Specify the argument na.action=na.pass to bypass the first NA check, the algorithm will run smoothly without error. This argument means litteraly "take no action" and see what's happens if you keep the NA. It's different from na.exclude that will remove the entire rows (which you don't want because the other variables of the row are non-missing)
Pre-process manually the data to either remove the missing or the entire column.

Code example:

df=mtcars
df[2:10, 'am'] <- NA
fit=randomForest::randomForest(mpg~.-am, df, na.action=na.pass)
fit$importance # check the absence of AM variable:
####      IncNodePurity
#### cyl      169.05853
#### disp     267.94975
#### hp       167.03634
#### drat      66.45550
#### wt       276.21383
#### qsec      25.33688
#### vs        30.48513
#### gear      15.39151
#### carb      24.60022

来源：https://stackoverflow.com/questions/46464765/r-variable-exclusion-from-formula-not-working-in-presence-of-missing-data

标签