I use caret in R. My final goal is to submit different dataframes to separate preProcess pca and then put the PCA-components together in one training with ridge regression.
1. when you perform preProcess (pca) within the train function:
When this is finished the final model is built with hyper parameters which had the best average performance on the test sets:
When you perform preProcess (pca) before the train function you are causing data leakage since you are using information from your CV test folds to estimate the pca coordinates. This causes optimistic bias during CV and should be avoided.
2. I am not aware of inbuilt caret functionality that would provide this juggling with several data sets. I trust this can be achieved with mlr3pipelines. Especially this tutorial is handy.
Here is an example on how to split the iris data set into two data sets, apply scaling and pca on each of them, combine the transformed columns and fit a rpart model. Tuning the number of PCA components retained as well one rpart hyper parameter using random search:
packages:
library(mlr3pipelines)
library(visNetwork)
library(mlr3learners)
library(mlr3tuning)
library(mlr3)
library(paradox)
define a pipeop selector named "slct1":
pos1 <- po("select", id = "slct1")
tell it which columns to select:
pos1$param_set$values$selector <- selector_name(c("Sepal.Length", "Sepal.Width"))
tell it what to do after it takes the features
pos1 %>>%
mlr_pipeops$get("scale", id = "scale1") %>>%
mlr_pipeops$get("pca", id = "pca1") -> pr1
define a pipeop selector named "slct2":
pos2 <- po("select", id = "slct2")
tell it which columns to select:
pos2$param_set$values$selector <- selector_name(c("Petal.Length", "Petal.Width"))
tell it what to do after it takes the features
pos2 %>>%
mlr_pipeops$get("scale", id = "scale2") %>>%
mlr_pipeops$get("pca", id = "pca2") -> pr2
combine the two outputs:
piper <- gunion(list(pr1, pr2)) %>>%
mlr_pipeops$get("featureunion")
and pipe them into a learner:
graph <- piper %>>%
mlr_pipeops$get("learner",
learner = mlr_learners$get("classif.rpart"))
lets check how it looks:
graph$plot(html = TRUE)
now define how this should be tuned:
glrn <- GraphLearner$new(graph)
10 fold CV:
cv10 <- rsmp("cv", folds = 10)
tune the number of PCA dimensions retained for each data set as well the complexity parameter of rpart:
ps <- ParamSet$new(list(
ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
ParamInt$new("pca1.rank.", lower = 1, upper = 2),
ParamInt$new("pca2.rank.", lower = 1, upper = 2)
))
define the task and the tuning:
task <- mlr_tasks$get("iris")
instance <- TuningInstance$new(
task = task,
learner = glrn,
resampling = cv10,
measures = msr("classif.ce"),
param_set = ps,
terminator = term("evals", n_evals = 20)
)
Initiate random search:
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
Perhaps this can also be done with tidymodels hover I have yet to try them.
EDIT: to answer questions in the comments.
In order to fully grasp mlr3 I advise you to read the book as well as tutorials for each of the accessory packages.
In the above example the number of PCA dimensions retained for each of the data sets was tuned jointly with the cp
hyper-parameter. This was defined in this line:
ps <- ParamSet$new(list(
ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
ParamInt$new("pca1.rank.", lower = 1, upper = 2),
ParamInt$new("pca2.rank.", lower = 1, upper = 2)
))
So for pca1, the algorithm could pick 1 or 2 pc to retain (I set it that way since there are only two features in each data set)
If you do not want to tune the number of dimensions in order to optimize performance then you could define the pipeop
like this:
pos1 %>>%
mlr_pipeops$get("scale", id = "scale1") %>>%
mlr_pipeops$get("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
in that case you should omit it from the parameter set:
ps <- ParamSet$new(list(
ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1)
))
As far as I know the variance explained can not be tweaked currently just the number of retained dimensions for pca transformation.
To change the predict type one can define a learner:
learner <- mlr_pipeops$get("learner",
learner = mlr_learners$get("classif.rpart"))
and set the predict type:
learner$learner$predict_type <- "prob"
and then create the graph:
graph <- piper %>>%
learner
To acquire performance for each hyper parameter combination:
instance$archive(unnest = "params")
To acquire predictions for each hyper parameter combination:
lapply(as.list(instance$archive(unnest = "params")[,"resample_result"])$resample_result,
function(x) x$predictions())
To acquire predictions for best hyper-parameter combination:
instance$best()$predictions()
If you would like it in the form of a data frame:
do.call(rbind,
lapply(instance$best()$predictions(),
function(x) data.frame(x$data$tab,
x$data$prob)))
probably there are some accessory functions that make this easier I just haven't played enough.