R caret: How do I apply separate pca to different dataframes before training?

前端未结

关注

 1  1578

I use caret in R. My final goal is to submit different dataframes to separate preProcess pca and then put the PCA-components together in one training with ridge regression.

相关标签:

1条回答

甜味超标

2021-01-05 19:13
1. when you perform preProcess (pca) within the train function:
- pca is run on each train set during CV and the train set is transformed
- several ridge regression models are estimated (based on the defined hyper parameter search) on each of these transformed train sets.
- based on the pca obtained for each train set the appropriate test set is transformed
- all of the fitted models are evaluated on appropriate transformed test sets
When this is finished the final model is built with hyper parameters which had the best average performance on the test sets:
- pca is applied on the whole train set data and transformed train data is obtained.
- using the pre-chosen hyper parameters a ridge regression model is built on the transformed train data
When you perform preProcess (pca) before the train function you are causing data leakage since you are using information from your CV test folds to estimate the pca coordinates. This causes optimistic bias during CV and should be avoided.

2. I am not aware of inbuilt caret functionality that would provide this juggling with several data sets. I trust this can be achieved with mlr3pipelines. Especially this tutorial is handy.

Here is an example on how to split the iris data set into two data sets, apply scaling and pca on each of them, combine the transformed columns and fit a rpart model. Tuning the number of PCA components retained as well one rpart hyper parameter using random search:

packages:
```
library(mlr3pipelines)
library(visNetwork)
library(mlr3learners)
library(mlr3tuning)
library(mlr3)  
library(paradox)
```
define a pipeop selector named "slct1":
```
pos1 <- po("select", id = "slct1")
```
tell it which columns to select:
```
pos1$param_set$values$selector <- selector_name(c("Sepal.Length", "Sepal.Width"))
```
tell it what to do after it takes the features
```
pos1 %>>%
  mlr_pipeops$get("scale", id = "scale1") %>>%
  mlr_pipeops$get("pca", id = "pca1") -> pr1
```
define a pipeop selector named "slct2":
```
pos2 <- po("select", id = "slct2")
```
tell it which columns to select:
```
pos2$param_set$values$selector <- selector_name(c("Petal.Length", "Petal.Width"))
```
tell it what to do after it takes the features
```
pos2 %>>%
   mlr_pipeops$get("scale", id = "scale2") %>>%
   mlr_pipeops$get("pca", id = "pca2") -> pr2
```
combine the two outputs:
```
piper <- gunion(list(pr1, pr2)) %>>%
  mlr_pipeops$get("featureunion")
```
and pipe them into a learner:
```
graph <- piper %>>%
  mlr_pipeops$get("learner",
                  learner = mlr_learners$get("classif.rpart"))
```
lets check how it looks:
```
graph$plot(html = TRUE)
```
now define how this should be tuned:
```
glrn <- GraphLearner$new(graph)
```
10 fold CV:
```
cv10 <- rsmp("cv", folds = 10)
```
tune the number of PCA dimensions retained for each data set as well the complexity parameter of rpart:
```
ps <- ParamSet$new(list(
  ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
  ParamInt$new("pca1.rank.",  lower = 1, upper = 2),
  ParamInt$new("pca2.rank.",  lower = 1, upper = 2)
))
```
define the task and the tuning:
```
task <- mlr_tasks$get("iris")

instance <- TuningInstance$new(
  task = task,
  learner = glrn,
  resampling = cv10,
  measures = msr("classif.ce"),
  param_set = ps,
  terminator = term("evals", n_evals = 20)
)
```
Initiate random search:
```
tuner <- TunerRandomSearch$new()
tuner$tune(instance)
instance$result
```
Perhaps this can also be done with tidymodels hover I have yet to try them.

EDIT: to answer questions in the comments.

In order to fully grasp mlr3 I advise you to read the book as well as tutorials for each of the accessory packages.

In the above example the number of PCA dimensions retained for each of the data sets was tuned jointly with the cp hyper-parameter. This was defined in this line:
```
ps <- ParamSet$new(list(
  ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1),
  ParamInt$new("pca1.rank.",  lower = 1, upper = 2),
  ParamInt$new("pca2.rank.",  lower = 1, upper = 2)
)) 
```
So for pca1, the algorithm could pick 1 or 2 pc to retain (I set it that way since there are only two features in each data set)

If you do not want to tune the number of dimensions in order to optimize performance then you could define the pipeop like this:
```
pos1 %>>%
  mlr_pipeops$get("scale", id = "scale1") %>>%
  mlr_pipeops$get("pca", id = "pca1", param_vals = list(rank. = 1)) -> pr1
```
in that case you should omit it from the parameter set:
```
ps <- ParamSet$new(list(
  ParamDbl$new("classif.rpart.cp", lower = 0, upper = 1)
))
```
As far as I know the variance explained can not be tweaked currently just the number of retained dimensions for pca transformation.

To change the predict type one can define a learner:
```
learner <- mlr_pipeops$get("learner",
                            learner = mlr_learners$get("classif.rpart"))
```
and set the predict type:
```
learner$learner$predict_type <- "prob"
```
and then create the graph:
```
graph <- piper %>>%
  learner
```
To acquire performance for each hyper parameter combination:
```
instance$archive(unnest = "params")
```
To acquire predictions for each hyper parameter combination:
```
lapply(as.list(instance$archive(unnest = "params")[,"resample_result"])$resample_result,
       function(x) x$predictions())
```
To acquire predictions for best hyper-parameter combination:
```
instance$best()$predictions()
```
If you would like it in the form of a data frame:
```
do.call(rbind,
        lapply(instance$best()$predictions(),
               function(x) data.frame(x$data$tab,
                                      x$data$prob)))
```
probably there are some accessory functions that make this easier I just haven't played enough.
0 讨论(0)
发布评论:

提交评论
- 加载中...