问题
I am doing nested cross-validation using the packages mlr and mlrMBO. The inner CV is used for parametrization (e.g. to find the optimal parameters). Since I want to compare the performance of different learners, I conduct a benchmark experiment using mlr's benchmark function. My question is the following: Is it possible to permute on the parametrized model/learner? When I call generateFeatureImportanceData on the learner I use in the benchmark experiment, the model is estimated again (ignoring the parametrization learned by sequenital optimization). Here is some code on the iris dataset to illustrate my question (no preprocessing and only for illustration).
library(dplyr)
library(mlr)
library(mlrMBO)
library(e1071)
nr_inner_cv <- 3L
nr_outer_cv <- 2L
inner = makeResampleDesc(
"CV"
, iters = nr_inner_cv # folds used in tuning/bayesian optimization)
learner_knn_base = makeLearner(id = "knn", "classif.knn")
par.set = makeParamSet(
makeIntegerParam("k", lower = 2L, upper = 10L)
)
ctrl = makeMBOControl()
ctrl <- makeMBOControl(propose.points = 1L)
ctrl <- setMBOControlTermination(ctrl, iters = 10L)
ctrl <- setMBOControlInfill(ctrl, crit = crit.ei, filter.proposed.points = TRUE)
set.seed(500)
tune.ctrl <- makeTuneControlMBO(
mbo.control = ctrl,
mbo.design = generateDesign(n = 10L, par.set = par.set)
)
learner_knn = makeTuneWrapper(learner = learner_knn_base
, resampling = inner
, par.set = par.set
, control = tune.ctrl
, show.info = TRUE
)
learner_nb <- makeLearner(
id = "naiveBayes"
,"classif.naiveBayes"
)
lrns = list(
learner_knn
, learner_nb
)
rdesc = makeResampleDesc("CV", iters = nr_outer_cv)
set.seed(12345)
bmr = mlr::benchmark(lrns, tasks = iris.task, show.info = FALSE,
resamplings = rdesc, models = TRUE, keep.extract = TRUE)
回答1:
I think this is a general question that we get more often: Can I do XY on models fitted in the CV? Short answer: Yes you can, but do you really want that?
Detailed answer
Similar Q's:
mlr: retrieve output of generateFilterValuesData within CV loop
R - mlr: Is there a easy way to get the variable importance of tuned support vector machine models in nested resampling (spatial)?
As @jakob-r's comment indicates, there are two options:
- Either you recreate the model outside the CV and call your desired function on it
- You do it within the CV on each fitted model of the respective fold via the
extract
argument inresample()
. See also Q2 linked above.
1) If you want to do this on all models, see 2) below. If you want to do it on the models of certain folds only: Which criteria did you use to select those?
2) is highly computational intensive and you might want to question why you want to do this - i.e. what do you want to do with all the information of each fold's model?
In general I've never seen a study/use case where has been applied. Everything you do in the CV contributes to estimating a performance value for each fold. You do not want to interact with these models afterwards.
You would rather want to estimate the feature importance once on the non-partitioned dataset (for which you have optimized the hyperpars beforehand once). This applies in the same way to other diagnostic methods of ML models: Apply them on your "full dataset", not for each model within the CV.
来源:https://stackoverflow.com/questions/59196835/mlr-how-to-compute-permuted-feature-importance-for-sequential-mbo-parametrized