问题
I recently started to look into caret package for a model I'm developing. I'm using the latest version. As the first step, I decided to use it for feature selection. The data I'm using has about 760 features and 10k observations. I created a simple function based on the training material on line. Unfortunately, I consistently get an error and so the process never finishes. Here is the code that produces error. In this example I am using a small subset of features. I started with the full set of features. I've also changed the subsets, number of folds and repeats to no avail. I know it will be hard to track down the issue without the data. I have shared a small subset of the data(in r object format as used below). If you have trouble getting the file from there try this link.
It always produces this error:
Error in { : task 1 failed - "replacement has length zero"
caretFeatureSelection <- function() {
library(caret)
library(mlbench)
library(Hmisc)
set.seed(10)
lr.features = c("f2", f271","f527","f528","f404", "f376", "f67", "f670", "f281", "f333", "f13", "f282", "f599",
"f597", "f68", "f629", "f378", "f230", "f229", "f273", "f768", "f406", "f630",
"f596", "f598", "f413", "f412", "f332", "f377", "f766", "f767", "f775", "f10", "f442")
trainDF <- readRDS(file='trainDF.rds')
trainDF <- trainDF[trainDF$loss>0,]
trainDF$lossProb <- trainDF$loss/100
y <- trainDF[,'lossProb']
x <- trainDF[,names(trainDF) %in% lr.features]
rm(trainDF)
subsets <- c(1:5, 10, 15, 20, 25)
ctrl <- rfeControl(functions = lrFuncs,
method = "repeatedcv",
repeats = 1,
number=5)
lrProfile <- rfe(x, y,
sizes = subsets,
rfeControl = ctrl)
lrProfile
}
回答1:
So looking at the data, there are three reasons for the failure. First,
> str(x)
'data.frame': 100 obs. of 34 variables:
$ f2 : Factor w/ 10 levels "1","2","3","4",..: 8 8 8 8 9 8 9 9 7 8 ...
<snip>
rfe
fits an lm
model to these data and generates 39 coefficients even though the data frame x
has 34 columns. As a result, rfe
gets... confused. Try using model.matrix
to convert the factor to dummy variables before running rfe
:
x2 <- model.matrix(~., data = x)[,-1] ## the -1 removes the intercept column
... but...
> table(x$f2)
1 2 3 4 6 7 8 9 10 11
0 0 0 2 2 5 32 36 23 0
so model.matrix
will generate some zero-variance predictors (which is an issue). You could make a new factor with new levels that excludes the empty levels but keep in mind that any resampling on these data will coerce some of the factor levels (e.g. "4", "6") into zero-variance predictors.
Secondly, there is perfect correlation between some predictors:
> cor(x$f597, x$f599)
[,1]
[1,] 1
This will cause NA
values for some of the model coefficients and lead to missing variable importances and will tank rfe
.
Unless you are using trees or some other model that is tolerant to sparse and/or correlated predictors, a possible workflow prior to rfe
could be:
> x2 <- model.matrix(~., data = x)[,-1]
>
> nzv <- nearZeroVar(x2)
> x3 <- x2[, -nzv]
>
> corr_mat <- cor(x3)
> too_high <- findCorrelation(corr_mat, cutoff = .9)
> x4 <- x3[, -too_high]
>
> c(ncol(x2), ncol(x3), ncol(x4))
[1] 42 37 27
Lastly, by the looks of y
you want to predict a number but lrFuncs
is for logistic regression so I assume it was a typo for lmFuncs
. If that is the case, rfe
works fine:
> subsets <- c(1:5, 10, 15, 20, 25)
> ctrl <- rfeControl(functions = lmFuncs,
+ method = "repeatedcv",
+ repeats = 1,
+ number=5)
> set.seed(1)
> lrProfile <- rfe(as.data.frame(x4), y,
+ sizes = subsets,
+ rfeControl = ctrl)
Max
来源:https://stackoverflow.com/questions/22129561/r-caret-package-rfe-never-finishes-error-task-1-failed-replacement-has-length