问题
I've seen various posts on how to select the independent variables for a model by using expand.grid
and then create a formula based on that selection. However, I prepare my input tables beforehand and store them in a list.
library(ranger)
data(iris)
Input_list <- list(iris1 = iris, iris2 = iris) # let's assume these are different input tables
I'm rather interested in trying all the possible hyperparameter combinations for a given algorithm (here: Random Forest using ranger
) for my list of input tables. I do the following to set up the grid:
hyper_grid <- expand.grid(
Input_table = names(Input_list),
Trees = c(10, 20),
Importance = c("none", "impurity"),
Classification = TRUE,
Repeats = 1:5,
Target = "Species")
> head(hyper_grid)
Input_table Trees Importance Classification Repeats Target
1 iris1 10 none TRUE 1 Species
2 iris2 10 none TRUE 1 Species
3 iris1 20 none TRUE 1 Species
4 iris2 20 none TRUE 1 Species
5 iris1 10 impurity TRUE 1 Species
6 iris2 10 impurity TRUE 1 Species
My question is, what is the best way to pass this values to the model? Currently I'm using a for loop
:
for (i in 1:nrow(hyper_grid)) {
RF_train <- ranger(
dependent.variable.name = hyper_grid[i, "Target"],
data = Input_list[[hyper_grid[i, "Input_table"]]], # referring to the named object in the list
num.trees = hyper_grid[i, "Trees"],
importance = hyper_grid[i, "Importance"],
classification = hyper_grid[i, "Classification"]) # otherwise regression is performed
print(RF_train)
}
iterating over each row of the grid. But for one, I have to tell the model now whether it is classification or regression. I assume the factor Species
is converted to numeric factor levels, so regression occurs by default. Is there a way to prevent this and also use e.g. apply
for this role? This way of iterating also results in messy function calls:
Call:
ranger(dependent.variable.name = hyper_grid[i, "Target"], data = Input_list[[hyper_grid[i, "Input_table"]]], num.trees = hyper_grid[i, "Trees"], importance = hyper_grid[i, "Importance"], classification = hyper_grid[i, "Classification"])
Second: in reality, the output of the model is then obviously not printed, but I immediately capture the important results (mainly the RF_train$confusion.matrix
) and write the results into an extended version of the hyper_grid
on the same row with the input parameters. Is this performance wise to costly? Because if I store the ranger-objects, I'm running into memory issues at some point.
Thank you!
回答1:
I think it is cleanest to wrap the training and extraction of the values you need into a function. The dots (...
) are needed for usage with the purrr::pmap
function below.
fit_and_extract_metrics <- function(Target, Input_table, Trees, Importance, Classification, ...) {
RF_train <- ranger(
dependent.variable.name = Target,
data = Input_list[[Input_table]], # referring to the named object in the list
num.trees = Trees,
importance = Importance,
classification = Classification) # otherwise regression is performed
data.frame(Prediction_error = RF_train$prediction.error,
True_positive = RF_train$confusion.matrix[1])
}
Then you can add the results as a column by mapping over the rows using for example purrr::pmap
:
hyper_grid$res <- purrr::pmap(hyper_grid, fit_and_extract_metrics)
By mapping in this way, the function is applied row by row, so you should not run into memory issues.
The result of purrr::pmap
is a list, which means that the column res
contains a list for every row. This can be unnested using tidyr::unnest
to spread the elements of that list across your data frame.
tidyr::unnest(hyper_grid, res)
I think this approach is very elegant, but it requires some tidyverse knowledge. I highly recommend this book if you want to know more about that. Chapter 25 (Many models) describes an approach similar to the one I'm taking here.
来源:https://stackoverflow.com/questions/60945003/how-to-use-expand-grid-values-to-run-various-model-hyperparameter-combinations-f