问题
I'd like use PipeOp
s to train a learner on three alternative transformations of a dataset:
- No transformation.
- Class balancing- down.
- Class balancing- up.
Then, I'd like to benchmark the three learned models.
My idea was to set up the pipeline as follows:
- Make pipeline: Input -> Impute dataset (optional) -> Branch -> Split into the three branches described above -> Add the learner within each branch -> Unbranch.
- Train pipeline and hope (that's where I'm getting it wrong) that the will be a result saved for each learner within each branch.
Unfortunately, following these steps results in a single learner that seems to have 'merged' everything from the different branches. I was hoping to get a list of length 3, but I get a list of length one instead.
R code:
library(data.table)
library(paradox)
library(mlr3)
library(mlr3filters)
library(mlr3learners)
library(mlr3misc)
library(mlr3pipelines)
library(mlr3tuning)
library(mlr3viz)
learner <- lrn("classif.rpart", predict_type = "prob")
learner$param_set$values <- list(
cp = 0,
maxdepth = 21,
minbucket = 12,
minsplit = 24
)
graph =
po("imputehist") %>>%
po("branch", c("nop", "classbalancing_up", "classbalancing_down")) %>>%
gunion(list(
po("nop", id = "null"),
po("classbalancing", id = "classbalancing_down", ratio = 2, reference = 'minor'),
po("classbalancing", id = "classbalancing_up", ratio = 2, reference = 'major')
)) %>>%
gunion(list(
po("learner", learner, id = "learner_null"),
po("learner", learner, id = "learner_classbalancing_down"),
po("learner", learner, id = "learner_classbalancing_up")
)) %>>%
po("unbranch")
plot(graph)
tr <- mlr3::resample(tsk("iris"), graph, rsmp("holdout"))
tr$learners
Question 1 How can I get three different results instead?
Question 2 How can I benchmark these three results within the pipeline following unbranching?
Question 3
What if I want to add multiple learners within each branch? I'd like some of the learners to be inserted with fixed hyperparameters, while for others I'd like to have their hyperparameters tuned with AutoTuner
within each branch. Then, I'd like to benchmark them within each branch and select the 'best' from each branch. Finally, I'd like to benchmark the three best learners to end up with the single best.
Many thanks.
回答1:
I think that I've found the answer to what I'm looking for. In brief, what I'd like to do is:
Create a graph pipeline with multiple learners. I'd like some of the learners to be inserted with fixed hyperparameters, while for others I'd like to have their hyperparameters tuned. Then, I'd like to benchmark them and select the 'best' one. I'd also like the benchmarking of learners to happen under different class balancing strategies, namely, do nothing, up-sample and down-sample. The optimal parameter settings for the up/down-sampling (e.g. ratio) would also be determined during tuning.
Two examples below, one that almost does what I want, the other doing exactly what I want.
Example 1: Build a pipe that includes all learners, that is, learners with fixed hyperparameters, as well as learners whose hyperparameters require tuning
As will be shown, it seems like a bad idea to have both kinds of learners (i.e. with fixed and tunable hyperparameters), because tuning the pipe disregards the learners with tunable hyperparameters.
####################################################################################
# Build Machine Learning pipeline that:
# 1. Imputes missing values (optional).
# 2. Tunes and benchmarks a range of learners.
# 3. Handles imbalanced data in different ways.
# 4. Identifies optimal learner for the task at hand.
# Abbreviations
# 1. td: Tuned. Learner already tuned with optimal hyperparameters, as found empirically by Probst et al. (2009). See http://jmlr.csail.mit.edu/papers/volume20/18-444/18-444.pdf
# 2. tn: Tuner. Optimal hyperparameters for the learner to be determined within the Tuner.
# 3. raw: Raw dataset in that class imbalances were not treated in any way.
# 4. up: Data upsampling to balance class imbalances.
# 5. down: Data downsampling to balance class imbalances.
# References
# Probst et al. (2009). http://jmlr.csail.mit.edu/papers/volume20/18-444/18-444.pdf
####################################################################################
task <- tsk('sonar')
# Indices for splitting data into training and test sets
train.idx <- task$data() %>%
select(Class) %>%
rownames_to_column %>%
group_by(Class) %>%
sample_frac(2 / 3) %>% # Stratified sample to maintain proportions between classes.
ungroup %>%
select(rowname) %>%
deframe %>%
as.numeric
test.idx <- setdiff(seq_len(task$nrow), train.idx)
# Define training and test sets in task format
task_train <- task$clone()$filter(train.idx)
task_test <- task$clone()$filter(test.idx)
# Define class balancing strategies
class_counts <- table(task_train$truth())
upsample_ratio <- class_counts[class_counts == max(class_counts)] /
class_counts[class_counts == min(class_counts)]
downsample_ratio <- 1 / upsample_ratio
# 1. Enrich minority class by factor 'ratio'
po_over <- po("classbalancing", id = "up", adjust = "minor",
reference = "minor", shuffle = FALSE, ratio = upsample_ratio)
# 2. Reduce majority class by factor '1/ratio'
po_under <- po("classbalancing", id = "down", adjust = "major",
reference = "major", shuffle = FALSE, ratio = downsample_ratio)
# 3. No class balancing
po_raw <- po("nop", id = "raw") # Pipe operator for 'do nothing' ('nop'), i.e. don't up/down-balance the classes.
# We will be using an XGBoost learner throughout with different hyperparameter settings.
# Define XGBoost learner with the optimal hyperparameters of Probst et al.
# Learner will be added to the pipeline later on, in conjuction with and without class balancing.
xgb_td <- lrn("classif.xgboost", predict_type = 'prob')
xgb_td$param_set$values <- list(
booster = "gbtree",
nrounds = 2563,
max_depth = 11,
min_child_weight = 1.75,
subsample = 0.873,
eta = 0.052,
colsample_bytree = 0.713,
colsample_bylevel = 0.638,
lambda = 0.101,
alpha = 0.894
)
xgb_td_raw <- GraphLearner$new(
po_raw %>>%
po('learner', xgb_td, id = 'xgb_td'),
predict_type = 'prob'
)
xgb_tn_raw <- GraphLearner$new(
po_raw %>>%
po('learner', lrn("classif.xgboost",
predict_type = 'prob'), id = 'xgb_tn'),
predict_type = 'prob'
)
xgb_td_up <- GraphLearner$new(
po_over %>>%
po('learner', xgb_td, id = 'xgb_td'),
predict_type = 'prob'
)
xgb_tn_up <- GraphLearner$new(
po_over %>>%
po('learner', lrn("classif.xgboost",
predict_type = 'prob'), id = 'xgb_tn'),
predict_type = 'prob'
)
xgb_td_down <- GraphLearner$new(
po_under %>>%
po('learner', xgb_td, id = 'xgb_td'),
predict_type = 'prob'
)
xgb_tn_down <- GraphLearner$new(
po_under %>>%
po('learner', lrn("classif.xgboost",
predict_type = 'prob'), id = 'xgb_tn'),
predict_type = 'prob'
)
learners_all <- list(
xgb_td_raw,
xgb_tn_raw,
xgb_td_up,
xgb_tn_up,
xgb_td_down,
xgb_tn_down
)
names(learners_all) <- sapply(learners_all, function(x) x$id)
# Create pipeline as a graph. This way, pipeline can be plotted. Pipeline can then be converted into a learner with GraphLearner$new(pipeline).
# Pipeline is a collection of Graph Learners (type ?GraphLearner in the command line for info).
# Each GraphLearner is a td or tn model (see abbreviations above) with or without class balancing.
# Up/down or no sampling happens within each GraphLearner, otherwise an error during tuning indicates that there are >= 2 data sources.
# Up/down or no sampling within each GraphLearner can be specified by chaining the relevant pipe operators (function po(); type ?PipeOp in command line) with the PipeOp of each learner.
graph <-
#po("imputehist") %>>% # Optional. Impute missing values only when using classifiers that can't handle them (e.g. Random Forest).
po("branch", names(learners_all)) %>>%
gunion(unname(learners_all)) %>>%
po("unbranch")
graph$plot() # Plot pipeline
pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # Don't forget to specify we want to predict probabilities and not classes.
ps_table <- as.data.table(pipe$param_set)
View(ps_table[, 1:4])
# Set hyperparameter ranges for the tunable learners
ps_xgboost <- ps_table$id %>%
lapply(
function(x) {
if (grepl('_tn', x)) {
if (grepl('.booster', x)) {
ParamFct$new(x, levels = "gbtree")
} else if (grepl('.nrounds', x)) {
ParamInt$new(x, lower = 100, upper = 110)
} else if (grepl('.max_depth', x)) {
ParamInt$new(x, lower = 3, upper = 10)
} else if (grepl('.min_child_weight', x)) {
ParamDbl$new(x, lower = 0, upper = 10)
} else if (grepl('.subsample', x)) {
ParamDbl$new(x, lower = 0, upper = 1)
} else if (grepl('.eta', x)) {
ParamDbl$new(x, lower = 0.1, upper = 0.6)
} else if (grepl('.colsample_bytree', x)) {
ParamDbl$new(x, lower = 0.5, upper = 1)
} else if (grepl('.gamma', x)) {
ParamDbl$new(x, lower = 0, upper = 5)
}
}
}
)
ps_xgboost <- Filter(Negate(is.null), ps_xgboost)
ps_xgboost <- ParamSet$new(ps_xgboost)
# Se parameter ranges for the class balancing strategies
ps_class_balancing <- ps_table$id %>%
lapply(
function(x) {
if (all(grepl('up.', x), grepl('.ratio', x))) {
ParamDbl$new(x, lower = 1, upper = upsample_ratio)
} else if (all(grepl('down.', x), grepl('.ratio', x))) {
ParamDbl$new(x, lower = downsample_ratio, upper = 1)
}
}
)
ps_class_balancing <- Filter(Negate(is.null), ps_class_balancing)
ps_class_balancing <- ParamSet$new(ps_class_balancing)
# Define parameter set
param_set <- ParamSetCollection$new(list(
ParamSet$new(list(pipe$param_set$params$branch.selection$clone())), # ParamFct can be copied.
ps_xgboost,
ps_class_balancing
))
# Add dependencies. For instance, we can only set the mtry value if the pipe is configured to use the Random Forest (ranger).
# In a similar manner, we want do add a dependency between, e.g. hyperparameter "raw.xgb_td.xgb_tn.booster" and branch "raw.xgb_td"
# See https://mlr3gallery.mlr-org.com/tuning-over-multiple-learners/
param_set$ids()[-1] %>%
lapply(
function(x) {
aux <- names(learners_all) %>%
sapply(
function(y) {
grepl(y, x)
}
)
aux <- names(aux[aux])
param_set$add_dep(x, "branch.selection",
CondEqual$new(aux))
}
)
# Set up tuning instance
instance <- TuningInstance$new(
task = task_train,
learner = pipe,
resampling = rsmp('cv', folds = 2),
measures = msr("classif.bbrier"),
#measures = prc_micro,
param_set,
terminator = term("evals", n_evals = 3))
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find best-performing branch
tuner$tune(instance)
instance$result
instance$archive()
instance$archive(unnest = "tune_x") # Unnest the tuner search space values
pipe$param_set$values <- instance$result$params
pipe$train(task_train)
pred <- pipe$predict(task_test)
pred$confusion
Note that the tuner chooses to disregard the tuning of the tunable learners and focuses on the tuned learners only. This can be confirmed by inspecting instance$result
: the only things that have been tuned for the tunable learners are the class-balancing parameters, which are actually not learner hyperparameters.
Example 2: Build a pipe that includes tunable learners only, find the 'best' one, and then benchmark it against the learners with fixed hyperparameters at a second stage.
Step 1: Build pipe for tunable learners
learners_all <- list(
#xgb_td_raw,
xgb_tn_raw,
#xgb_td_up,
xgb_tn_up,
#xgb_td_down,
xgb_tn_down
)
names(learners_all) <- sapply(learners_all, function(x) x$id)
# Create pipeline as a graph. This way, pipeline can be plotted. Pipeline can then be converted into a learner with GraphLearner$new(pipeline).
# Pipeline is a collection of Graph Learners (type ?GraphLearner in the command line for info).
# Each GraphLearner is a td or tn model (see abbreviations above) with or without class balancing.
# Up/down or no sampling happens within each GraphLearner, otherwise an error during tuning indicates that there are >= 2 data sources.
# Up/down or no sampling within each GraphLearner can be specified by chaining the relevant pipe operators (function po(); type ?PipeOp in command line) with the PipeOp of each learner.
graph <-
#po("imputehist") %>>% # Optional. Impute missing values only when using classifiers that can't handle them (e.g. Random Forest).
po("branch", names(learners_all)) %>>%
gunion(unname(learners_all)) %>>%
po("unbranch")
graph$plot() # Plot pipeline
pipe <- GraphLearner$new(graph) # Convert pipeline to learner
pipe$predict_type <- 'prob' # Don't forget to specify we want to predict probabilities and not classes.
ps_table <- as.data.table(pipe$param_set)
View(ps_table[, 1:4])
ps_xgboost <- ps_table$id %>%
lapply(
function(x) {
if (grepl('_tn', x)) {
if (grepl('.booster', x)) {
ParamFct$new(x, levels = "gbtree")
} else if (grepl('.nrounds', x)) {
ParamInt$new(x, lower = 100, upper = 110)
} else if (grepl('.max_depth', x)) {
ParamInt$new(x, lower = 3, upper = 10)
} else if (grepl('.min_child_weight', x)) {
ParamDbl$new(x, lower = 0, upper = 10)
} else if (grepl('.subsample', x)) {
ParamDbl$new(x, lower = 0, upper = 1)
} else if (grepl('.eta', x)) {
ParamDbl$new(x, lower = 0.1, upper = 0.6)
} else if (grepl('.colsample_bytree', x)) {
ParamDbl$new(x, lower = 0.5, upper = 1)
} else if (grepl('.gamma', x)) {
ParamDbl$new(x, lower = 0, upper = 5)
}
}
}
)
ps_xgboost <- Filter(Negate(is.null), ps_xgboost)
ps_xgboost <- ParamSet$new(ps_xgboost)
ps_class_balancing <- ps_table$id %>%
lapply(
function(x) {
if (all(grepl('up.', x), grepl('.ratio', x))) {
ParamDbl$new(x, lower = 1, upper = upsample_ratio)
} else if (all(grepl('down.', x), grepl('.ratio', x))) {
ParamDbl$new(x, lower = downsample_ratio, upper = 1)
}
}
)
ps_class_balancing <- Filter(Negate(is.null), ps_class_balancing)
ps_class_balancing <- ParamSet$new(ps_class_balancing)
param_set <- ParamSetCollection$new(list(
ParamSet$new(list(pipe$param_set$params$branch.selection$clone())), # ParamFct can be copied.
ps_xgboost,
ps_class_balancing
))
# Add dependencies. For instance, we can only set the mtry value if the pipe is configured to use the Random Forest (ranger).
# In a similar manner, we want do add a dependency between, e.g. hyperparameter "raw.xgb_td.xgb_tn.booster" and branch "raw.xgb_td"
# See https://mlr3gallery.mlr-org.com/tuning-over-multiple-learners/
param_set$ids()[-1] %>%
lapply(
function(x) {
aux <- names(learners_all) %>%
sapply(
function(y) {
grepl(y, x)
}
)
aux <- names(aux[aux])
param_set$add_dep(x, "branch.selection",
CondEqual$new(aux))
}
)
# Set up tuning instance
instance <- TuningInstance$new(
task = task_train,
learner = pipe,
resampling = rsmp('cv', folds = 2),
measures = msr("classif.bbrier"),
#measures = prc_micro,
param_set,
terminator = term("evals", n_evals = 3))
tuner <- TunerRandomSearch$new()
# Tune pipe learner to find best-performing branch
tuner$tune(instance)
instance$result
instance$archive()
instance$archive(unnest = "tune_x") # Unnest the tuner search space values
pipe$param_set$values <- instance$result$params
pipe$train(task_train)
pred <- pipe$predict(task_test)
pred$confusion
Note that now instance$result
returns optimal results for the learners' hyperparameters too, and not just for the class-balancing parameters.
Step 2: Benchmark 'best' tunable learner (now tuned) and the learners that have fixed hyperparameters
# Define re-sampling and instantiate it so always the same split will be used
resampling <- rsmp("cv", folds = 2)
set.seed(123)
resampling$instantiate(task_train)
bmr <- benchmark(
design = benchmark_grid(
task_train,
learner = list(pipe, xgb_td_raw, xgb_td_up, xgb_tn_down),
resampling
),
store_models = TRUE # Only needed if you want to inspect the models
)
bmr$aggregate(msr("classif.bbrier"))
A few issues to consider
- I should have probably created a second, separate pipe for the
learners that have fixed hyperparameters, in order to at least have
the class-balancing parameters tuned. Then, the two pipes (tunable
and fixed hyperparameters) would be benchmarked with
benchmark()
. - I should have probably used the same resampling strategy from beginning to end? I.e., instantiate the reampling strategy right before tuning the first pipe, so that this strategy is also used in the second pipe and in the final benchmark.
Comments/validation more than welcome.
(special thanks to missuse for the constructive comments)
回答2:
The simplest way to benchmark several pipelines is to define the appropriate graphs and use the benchmark function:
library(paradox)
library(mlr3)
library(mlr3pipelines)
library(mlr3tuning)
learner <- lrn("classif.rpart", predict_type = "prob")
learner$param_set$values <- list(
cp = 0,
maxdepth = 21,
minbucket = 12,
minsplit = 24
)
Create the tree graphs:
graph 1, just imputehist
graph_nop <- po("imputehist") %>>%
learner
graph 2 : imputehist and undersample majority class (ratio relative to majority class)
graph_down <- po("imputehist") %>>%
po("classbalancing", id = "undersample", adjust = "major",
reference = "major", shuffle = FALSE, ratio = 1/2) %>>%
learner
graph 3: impute hist and oversample minority class (ratio relative to minority class)
graph_up <- po("imputehist") %>>%
po("classbalancing", id = "oversample", adjust = "minor",
reference = "minor", shuffle = FALSE, ratio = 2) %>>%
learner
Convert graphs to learners and set predict_type
graph_nop <- GraphLearner$new(graph_nop)
graph_nop$predict_type <- "prob"
graph_down <- GraphLearner$new(graph_down)
graph_down$predict_type <- "prob"
graph_up <- GraphLearner$new(graph_up)
graph_up$predict_type <- "prob"
define re-sampling and instantiate it so always the same split will be used:
hld <- rsmp("holdout")
set.seed(123)
hld$instantiate(tsk("sonar"))
Benchmark
bmr <- benchmark(design = benchmark_grid(task = tsk("sonar"),
learner = list(graph_nop,
graph_up,
graph_down),
hld),
store_models = TRUE) #only needed if you want to inspect the models
check result using different measures:
bmr$aggregate(msr("classif.auc"))
nr resample_result task_id learner_id resampling_id iters classif.auc
1: 1 <ResampleResult> sonar imputehist.classif.rpart holdout 1 0.7694257
2: 2 <ResampleResult> sonar imputehist.oversample.classif.rpart holdout 1 0.7360642
3: 3 <ResampleResult> sonar imputehist.undersample.classif.rpart holdout 1 0.7668919
bmr$aggregate(msr("classif.ce"))
nr resample_result task_id learner_id resampling_id iters classif.ce
1: 1 <ResampleResult> sonar imputehist.classif.rpart holdout 1 0.3043478
2: 2 <ResampleResult> sonar imputehist.oversample.classif.rpart holdout 1 0.3188406
3: 3 <ResampleResult> sonar imputehist.undersample.classif.rpart holdout 1 0.2898551
This can be also performed within one pipeline with branching but one would need to define the paramset and use a tuner:
graph2 <-
po("imputehist") %>>%
po("branch", c("nop", "classbalancing_up", "classbalancing_down")) %>>%
gunion(list(
po("nop", id = "nop"),
po("classbalancing", id = "classbalancing_up", ratio = 2, reference = 'major'),
po("classbalancing", id = "classbalancing_down", ratio = 2, reference = 'minor')
)) %>>%
po("unbranch") %>>%
learner
graph2$plot()
Note that the unbranch happens before the learner since one (always the same) learner is being used. Convert graph to learner and set predict_type
graph2 <- GraphLearner$new(graph2)
graph2$predict_type <- "prob"
Define the param set. In this case just the different branch options.
ps <- ParamSet$new(
list(
ParamFct$new("branch.selection", levels = c("nop", "classbalancing_up", "classbalancing_down"))
))
In general you would want to add also learner hyper parameters like cp and minsplit for rpart as well as the ratio of over/undersampling.
Create a tuning instance and grid search with resolution 1 since no other parameters are tuned. The tuner will iterate through different pipeline branches as defined in the paramset.
instance <- TuningInstance$new(
task = tsk("sonar"),
learner = graph2,
resampling = hld,
measures = msr("classif.auc"),
param_set = ps,
terminator = term("none")
)
tuner <- tnr("grid_search", resolution = 1)
set.seed(321)
tuner$tune(instance)
Check the result:
instance$archive(unnest = "tune_x")
nr batch_nr resample_result task_id
1: 1 1 <ResampleResult> sonar
2: 2 2 <ResampleResult> sonar
3: 3 3 <ResampleResult> sonar
learner_id resampling_id iters params
1: imputehist.branch.null.classbalancing_up.classbalancing_down.unbranch.classif.rpart holdout 1 <list>
2: imputehist.branch.null.classbalancing_up.classbalancing_down.unbranch.classif.rpart holdout 1 <list>
3: imputehist.branch.null.classbalancing_up.classbalancing_down.unbranch.classif.rpart holdout 1 <list>
warnings errors classif.auc branch.selection
1: 0 0 0.7842061 classbalancing_down
2: 0 0 0.7673142 classbalancing_up
3: 0 0 0.7694257 nop
Even though the above example is possible, I think mlr3pipelines is designed so you tune learner hyper parameters jointly with preprocessing steps while also selecting best preprocessing steps (via branching).
Question 3 has multiple sub questions some of which would take quite a lot of code and explaining to answer. I suggest checking the mlr3book as well as the mlr3gallery.
EDIT: a mlr3 gallery post: https://mlr3gallery.mlr-org.com/posts/2020-03-30-imbalanced-data/ is relevant for the question.
来源:https://stackoverflow.com/questions/61014457/mlr3-pipeops-create-branches-with-different-data-transformations-and-benchmark